Extracting memorized pieces of books from open-weight language models

a year ago

The study examines the extent to which open-weight language models (LLMs) memorize copyrighted books.
Using probabilistic extraction techniques, researchers extracted parts of the Books3 dataset from 13 LLMs.
Results show that memorization varies by model and book, with some models memorizing books like Harry Potter and 1984 almost entirely.
Larger LLMs do not memorize most books, either in whole or in part.
The findings have significant implications for copyright lawsuits, though they do not clearly favor either plaintiffs or defendants.

Hasty Briefsbeta