OpenAI's models 'memorized' copyrighted content, new study suggests
a year ago
- #OpenAI
- #AI Copyright
- #Training Data
- A new study suggests OpenAI may have trained some AI models on copyrighted content without permission.
- OpenAI faces lawsuits from authors and programmers alleging unauthorized use of their works for model training.
- The study introduces a method to detect 'memorized' training data in AI models using 'high-surprisal' words.
- GPT-4 and GPT-3.5 were tested and showed signs of memorizing portions of fiction books and New York Times articles.
- The findings highlight the need for greater transparency and tools to audit AI training data.
- OpenAI advocates for looser restrictions on using copyrighted data for AI training, despite existing lawsuits.