UK Biobank health data keeps ending up on GitHub
10 hours ago
- #copyright takedowns
- #data privacy
- #health data exposure
- UK Biobank uses copyright takedown notices to remove health data from GitHub, exploiting a mechanism typically for pirated software due to the lack of UK privacy laws like the DMCA.
- Targeted files include Jupyter/R notebooks, genetic data files (PLINK, BOLT-LMM, BGEN), tabular datasets (CSV, TSV, Excel), and analysis scripts, often focusing on specific files rather than entire repositories.
- Takedown notices began in July 2025, with 110 requests to GitHub, pausing in early 2026 and resuming after Guardian investigations exposed data exposure and takedown ineffectiveness.
- Developers targeted are from at least 14 countries, primarily the United States (24) and China (21), with many lacking location details on GitHub profiles.
- Methodology involves analyzing GitHub's DMCA repository, extracting filing dates and URLs, and using the GitHub API to gather user locations, though data is limited and imperfect.
- The exposure highlights governance challenges for UK Biobank, with Guardian investigations revealing data matching risks, unauthorized access by insurance companies, and exclusive early data access to pharmaceutical firms.