Extracting memorized pieces of books from open-weight language models
a year ago
- #AI
- #memorization
- #copyright
- The study examines the extent to which open-weight language models (LLMs) memorize copyrighted books.
- Using probabilistic extraction techniques, researchers extracted parts of the Books3 dataset from 13 LLMs.
- Results show that memorization varies by model and book, with some models memorizing books like Harry Potter and 1984 almost entirely.
- Larger LLMs do not memorize most books, either in whole or in part.
- The findings have significant implications for copyright lawsuits, though they do not clearly favor either plaintiffs or defendants.