Poisoning Well for LLMs

5 days ago

Copy Link

Large Language Models (LLMs) are trained on content without proper consent from authors.
Using robots.txt to block LLM crawlers is ineffective as many do not respect it.
Authors are experimenting with poisoning LLMs by feeding them corrupted content via nofollow links.
Googlebot can be verified by IP matching, but this is technically complex.
A method involves creating nonsense versions of articles accessible only via nofollow links to target LLM crawlers.
The nonsense content includes grammatical distortions and lexical absurdities to confuse LLMs.
Implementation details include using templates, transforms, and word substitutions to generate nonsense content.
The goal is to deplete LLM crawler resources and degrade their output quality.
Collaboration is sought to improve the approach from those knowledgeable in crawler and LLM behaviors.

Hasty Briefsbeta