Preprint
Article

This version is not peer-reviewed.

A False Sense of Privacy: Evaluating the Limitsof Textual Data Sanitization for Privacy Protection

Submitted:

23 December 2025

Posted:

23 December 2025

You are already at the latest version

Abstract
The widespread use of textual data sanitization techniques,such as identifier removal and synthetic data generation, has raised ques-tions about their effectiveness in preserving individual privacy. This studyintroduced a comprehensive evaluation framework designed to measureprivacy leakage in sanitized datasets at a semantic level. The frameworkoperated in two stages: linking auxiliary information to sanitized recordsusing sparse retrieval and evaluating semantic similarity between orig-inal and matched records using a language model. Experiments wereconducted on two real-world datasets, MedQA and WildChat, to assessthe privacy-utility trade-off across various sanitization methods. Resultsshowed that traditional PII removal methods retained significant privateinformation, with over 90% of original claims still inferable. Syntheticdata generation demonstrated improved privacy performance, especiallywhen enhanced with differential privacy, though often at the cost ofdownstream task utility. The evaluation also revealed that text coher-ence and the nature of auxiliary knowledge significantly influenced re-identification risks. These findings emphasized the limitations of currentsurface-level sanitization practices and highlighted the need for robust,context-aware privacy mechanisms that balance utility and protection insensitive textual data releases.
Keywords: 
;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2025 MDPI (Basel, Switzerland) unless otherwise stated