(d, c, l)-Privacy : Privacy Preservation Models for Content‑Sensitive Datasets Using Information Retrieval Techniques

Surapon Riyana; Nattapon Harnsamut

doi:10.20944/preprints202604.1684.v1

Submitted:

23 April 2026

Posted:

23 April 2026

You are already at the latest version

Abstract

Both data utility and data privacy must be carefully considered when datasets containing users’ sensitive information are released for use beyond the scope of the data‑collecting organization. To achieve an appropriate balance between data utility and data privacy in such released datasets, a variety of privacy preservation models have been proposed, including k‑Anonymity, l‑Diversity, Anatomy, t‑Closeness, and Differential privacy. Unfortunately, these privacy preservation models can sufficiently address concerns of privacy violation issues in simple datasets. Moreover, to the best of our knowledge about them, they still have various vulnerabilities that must be improved such as data utility, complexity, and the new privacy violation techniques that are discovered after they are proposed. Furthermore, we found that they are not well-suited to address concern of privacy violation issues in datasets containing content-sensitive values (or sometimes they are called content-based datasets). To rid these vulnerabilities, a new privacy preservation model is proposed in this work, it is called (d, c, l)-Privacy. It is proposed to address concerns of privacy violation issues in content-sensitive datasets. It is based on expert and mechanism term document measurements. That is, the released datasets can reduce about the concern of privacy violation issues after they are satisfied by the d, c, and l parameters. To achieve (d, c, l)-Privacy constraints, there are three algorithms that are proposed in this work, i.e., SCFS, greedy, and optimal (d, c, l)-Privacy algorithms. The aim of SCFS (d, c, l)-Privacy is that aside from the released dataset is satisfied by the d, c, and l parameters, the execution time is maintained as much as possible. While the greedy (d, c, l)-Privacy will be mindful of both the execution time and data utility. Another proposed algorithm, the optimal (d, c, l)-Privacy algorithm, aims to maintain the meaning or usefulness of the data as much as possible.The experimental results show that the proposed algorithms are effective in mitigating privacy breaches in released datasets under the (d, c, l)-Privacy constraints. Among the evaluated algorithms, FCFS achieves the highest time efficiency, whereas the greedy algorithm offers a better balance by preserving data semantics within a reasonable computational cost. The optimal algorithm consistently maintains the highest level of data utility, albeit at increased computational expense. These findings indicate that the proposed algorithms are not only effective in preserving privacy data in released datasets but are also suitable for practical deployment in real‑world data‑releasing scenarios.

Keywords:

privacy preservation

;

privacy threat

;

natural language processing (NLP)

;

information retrieval (IR)

;

content-based datasets

;

content-sensitive datasets

;

first come first serve (FCFS) algorithm

;

greedy algorithm

;

optimal algorithm

Subject:

Computer Science and Mathematics - Information Systems

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

(d, c, l)-Privacy : Privacy Preservation Models for Content‑Sensitive Datasets Using Information Retrieval Techniques

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe