Poisoning Well for LLMs
5 days ago
- #crawlers
- #LLM
- #content-poisoning
- Large Language Models (LLMs) are trained on content without proper consent from authors.
- Using robots.txt to block LLM crawlers is ineffective as many do not respect it.
- Authors are experimenting with poisoning LLMs by feeding them corrupted content via nofollow links.
- Googlebot can be verified by IP matching, but this is technically complex.
- A method involves creating nonsense versions of articles accessible only via nofollow links to target LLM crawlers.
- The nonsense content includes grammatical distortions and lexical absurdities to confuse LLMs.
- Implementation details include using templates, transforms, and word substitutions to generate nonsense content.
- The goal is to deplete LLM crawler resources and degrade their output quality.
- Collaboration is sought to improve the approach from those knowledgeable in crawler and LLM behaviors.