OpenAI's models 'memorized' copyrighted content, new study suggests

a year ago

A new study suggests OpenAI may have trained some AI models on copyrighted content without permission.
OpenAI faces lawsuits from authors and programmers alleging unauthorized use of their works for model training.
The study introduces a method to detect 'memorized' training data in AI models using 'high-surprisal' words.
GPT-4 and GPT-3.5 were tested and showed signs of memorizing portions of fiction books and New York Times articles.
The findings highlight the need for greater transparency and tools to audit AI training data.
OpenAI advocates for looser restrictions on using copyrighted data for AI training, despite existing lawsuits.

Hasty Briefsbeta