Re-Identification Risk vs. K-Anonymity
5 hours ago
- #data-privacy
- #k-anonymity
- #re-identification
- The post explores the trade-off between k-anonymity and re-identification risk through an experimental walkthrough.
- A synthetic dataset of 2000 individuals was created with fields like age, zip3, sex, and lab_glucose as quasi-identifiers (QIs).
- k-anonymity was varied from 1 to 20, with anonymization techniques including age binning, top-coding, and rare ZIP suppression.
- A simulated attacker with partial knowledge (age, zip3, sex) attempts re-identification using a global optimization strategy.
- Re-identification success (Hit@1) drops sharply as k increases, especially beyond k=5–7, indicating stronger anonymity.
- Data utility (e.g., ZIP Utility, Mean Age Drift) degrades non-linearly with higher k, with significant loss beyond k=8.
- The privacy–utility frontier shows diminishing returns: high k yields minimal extra privacy at major utility cost.
- Non-QI attributes (e.g., lab_glucose) remain unchanged, preserving analytical value while QIs are anonymized.
- The attacker model is basic; real-world risks could be higher with additional knowledge or advanced techniques.
- Key takeaway: Moderate k (e.g., 5–7) offers strong privacy with reasonable utility, while higher k may render data unusable.