Hasty Briefsbeta

Poisoning Well for LLMs

5 days ago
  • #crawlers
  • #LLM
  • #content-poisoning
  • Large Language Models (LLMs) are trained on content without proper consent from authors.
  • Using robots.txt to block LLM crawlers is ineffective as many do not respect it.
  • Authors are experimenting with poisoning LLMs by feeding them corrupted content via nofollow links.
  • Googlebot can be verified by IP matching, but this is technically complex.
  • A method involves creating nonsense versions of articles accessible only via nofollow links to target LLM crawlers.
  • The nonsense content includes grammatical distortions and lexical absurdities to confuse LLMs.
  • Implementation details include using templates, transforms, and word substitutions to generate nonsense content.
  • The goal is to deplete LLM crawler resources and degrade their output quality.
  • Collaboration is sought to improve the approach from those knowledgeable in crawler and LLM behaviors.