Hasty Briefsbeta

Bilingual

Finetuning Activates Verbatim Recall of Copyrighted Books in LLMs

7 hours ago
  • #copyright infringement
  • #AI ethics
  • #language models
  • Finetuning large language models (LLMs) to expand plot summaries into full text triggers verbatim recall of copyrighted books.
  • Models like GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3 reproduce up to 85–90% of held-out copyrighted books with long verbatim spans.
  • This extraction generalizes: finetuning on one author's novels enables recall of books from over 30 unrelated authors.
  • Finetuning on real author works reactivates latent memorization from pretraining, while synthetic text yields near-zero extraction.
  • Multiple models from different providers memorize the same books in similar regions, indicating an industry-wide vulnerability.
  • The findings challenge legal defenses based on safety alignment measures and undermine premises of recent fair use rulings.