Finetuning Activates Verbatim Recall of Copyrighted Books in LLMs
10 hours ago
- #copyright infringement
- #AI ethics
- #language models
- Finetuning large language models (LLMs) to expand plot summaries into full text triggers verbatim recall of copyrighted books.
- Models like GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3 reproduce up to 85–90% of held-out copyrighted books with long verbatim spans.
- This extraction generalizes: finetuning on one author's novels enables recall of books from over 30 unrelated authors.
- Finetuning on real author works reactivates latent memorization from pretraining, while synthetic text yields near-zero extraction.
- Multiple models from different providers memorize the same books in similar regions, indicating an industry-wide vulnerability.
- The findings challenge legal defenses based on safety alignment measures and undermine premises of recent fair use rulings.