Hasty Briefsbeta

Bilingual

Extracting memorized pieces of books from open-weight language models

a year ago
  • #AI
  • #memorization
  • #copyright
  • The study examines the extent to which open-weight language models (LLMs) memorize copyrighted books.
  • Using probabilistic extraction techniques, researchers extracted parts of the Books3 dataset from 13 LLMs.
  • Results show that memorization varies by model and book, with some models memorizing books like Harry Potter and 1984 almost entirely.
  • Larger LLMs do not memorize most books, either in whole or in part.
  • The findings have significant implications for copyright lawsuits, though they do not clearly favor either plaintiffs or defendants.