Finetuning Activates Verbatim Recall of Copyrighted Books in LLMs

22 days ago

Finetuning large language models (LLMs) to expand plot summaries into full text triggers verbatim recall of copyrighted books.
Models like GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3 reproduce up to 85–90% of held-out copyrighted books with long verbatim spans.
This extraction generalizes: finetuning on one author's novels enables recall of books from over 30 unrelated authors.
Finetuning on real author works reactivates latent memorization from pretraining, while synthetic text yields near-zero extraction.
Multiple models from different providers memorize the same books in similar regions, indicating an industry-wide vulnerability.
The findings challenge legal defenses based on safety alignment measures and undermine premises of recent fair use rulings.

Hasty Briefsbeta