Hasty Briefsbeta

Bilingual

Researchers suggest OpenAI trained AI models on paywalled O'Reilly books

a year ago

#AI Ethics
#Copyright Infringement
#OpenAI

OpenAI accused of training AI on copyrighted content without permission.
New paper alleges OpenAI used non-public, unlicensed books to train GPT-4o.
AI models like GPT-4o rely on vast data to predict and generate content.
Training on synthetic data risks worsening model performance.
AI Disclosures Project claims GPT-4o recognizes paywalled O’Reilly Media books.
DE-COP method used to detect copyrighted content in training data.
GPT-4o shows higher recognition of paywalled content than GPT-3.5 Turbo.
OpenAI may have sourced paywalled content from user inputs.
OpenAI seeks high-quality training data, hiring experts to fine-tune models.
OpenAI has licensing deals but faces lawsuits over copyright practices.