Researchers Deanonymize Reddit and Hacker News Users at Scale

2 months ago

ETH Zurich study shows LLMs can deanonymize pseudonymous accounts with 68% recall at 90% precision.
A four-stage pipeline was used: Extract identity signals → Search via embeddings → Reason over candidates → Calibrate confidence scores.
Results: Hacker News → LinkedIn (45.1% recall at 99% precision), Reddit movie communities (2.8% recall at 99% precision), Temporal Reddit splits (38.4% recall at 99% precision).
Fully autonomous agents correctly identified 67% of users at 90% precision, costing $1-4 per deanonymization.
Classical deanonymization methods had significantly lower recall rates (0-0.2%).
Pseudonymity is no longer practical; persistent usernames can be linked to real identities.
More posts make users easier to identify (48% recall for users sharing 10+ movies vs. 3% for one movie).
Platforms should rate-limit API access, restrict bulk data exports, and consider privacy costs of public scrapable data.
Researchers and activists should compartmentalize identities and assume LLM-powered deanonymization is a threat.
LLMs excel at extracting unstructured signals, semantic search, and reasoning, reducing deanonymization costs.
Threatened groups include whistleblowers, activists, abuse survivors, and others relying on anonymity.
Mitigations like k-anonymity and differential privacy are ineffective for text anonymization.

Hasty Briefsbeta