Hasty Briefsbeta

Re-Identification Risk vs. K-Anonymity

5 hours ago
  • #data-privacy
  • #k-anonymity
  • #re-identification
  • The post explores the trade-off between k-anonymity and re-identification risk through an experimental walkthrough.
  • A synthetic dataset of 2000 individuals was created with fields like age, zip3, sex, and lab_glucose as quasi-identifiers (QIs).
  • k-anonymity was varied from 1 to 20, with anonymization techniques including age binning, top-coding, and rare ZIP suppression.
  • A simulated attacker with partial knowledge (age, zip3, sex) attempts re-identification using a global optimization strategy.
  • Re-identification success (Hit@1) drops sharply as k increases, especially beyond k=5–7, indicating stronger anonymity.
  • Data utility (e.g., ZIP Utility, Mean Age Drift) degrades non-linearly with higher k, with significant loss beyond k=8.
  • The privacy–utility frontier shows diminishing returns: high k yields minimal extra privacy at major utility cost.
  • Non-QI attributes (e.g., lab_glucose) remain unchanged, preserving analytical value while QIs are anonymized.
  • The attacker model is basic; real-world risks could be higher with additional knowledge or advanced techniques.
  • Key takeaway: Moderate k (e.g., 5–7) offers strong privacy with reasonable utility, while higher k may render data unusable.