Finetuning Activates Verbatim Recall of Copyrighted Books in LLMs

7 hours ago

Finetuning large language models (LLMs) to expand plot summaries into full text triggers verbatim recall of copyrighted books.
Models like GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3 reproduce up to 85–90% of held-out copyrighted books with long verbatim spans.
This extraction generalizes: finetuning on one author's novels enables recall of books from over 30 unrelated authors.
Finetuning on real author works reactivates latent memorization from pretraining, while synthetic text yields near-zero extraction.
Multiple models from different providers memorize the same books in similar regions, indicating an industry-wide vulnerability.
The findings challenge legal defenses based on safety alignment measures and undermine premises of recent fair use rulings.

Hasty Briefsbeta