Re-Identification Risk vs. K-Anonymity

5 hours ago

Copy Link

The post explores the trade-off between k-anonymity and re-identification risk through an experimental walkthrough.
A synthetic dataset of 2000 individuals was created with fields like age, zip3, sex, and lab_glucose as quasi-identifiers (QIs).
k-anonymity was varied from 1 to 20, with anonymization techniques including age binning, top-coding, and rare ZIP suppression.
A simulated attacker with partial knowledge (age, zip3, sex) attempts re-identification using a global optimization strategy.
Re-identification success (Hit@1) drops sharply as k increases, especially beyond k=5–7, indicating stronger anonymity.
Data utility (e.g., ZIP Utility, Mean Age Drift) degrades non-linearly with higher k, with significant loss beyond k=8.
The privacy–utility frontier shows diminishing returns: high k yields minimal extra privacy at major utility cost.
Non-QI attributes (e.g., lab_glucose) remain unchanged, preserving analytical value while QIs are anonymized.
The attacker model is basic; real-world risks could be higher with additional knowledge or advanced techniques.
Key takeaway: Moderate k (e.g., 5–7) offers strong privacy with reasonable utility, while higher k may render data unusable.

Hasty Briefsbeta