A small number of samples can poison LLMs of any size

5 hours ago

https://www.anthropic.com/research/small-samples-poison

Copy Link

#AI Security
#Data Poisoning
#LLM Vulnerabilities

A joint study found that as few as 250 malicious documents can create a 'backdoor' vulnerability in large language models (LLMs), regardless of model size or training data volume.
The study challenges the assumption that attackers need to control a percentage of training data, showing instead that a small, fixed number of poisoned documents can be effective.
Backdoors in LLMs can be triggered by specific phrases, causing the model to exhibit undesirable behaviors, such as producing gibberish text or exfiltrating sensitive data.
The research is the largest poisoning investigation to date, revealing that poisoning attacks require a near-constant number of documents across different model sizes.
Creating 250 malicious documents is significantly easier than creating millions, making poisoning attacks more feasible for potential attackers.
The study tested a 'denial-of-service' attack where models were trained to produce gibberish text upon encountering a specific trigger phrase, demonstrating clear and measurable success.
Results showed that model size does not affect poisoning success; larger models trained on more data were just as vulnerable as smaller ones when exposed to the same number of poisoned documents.
The findings suggest that absolute count, not relative proportion, of poisoned documents is key to attack effectiveness, with 250 documents being sufficient to backdoor models.
The study raises concerns about the practicality of poisoning attacks and encourages further research into understanding and mitigating these vulnerabilities.
Despite the risks of publicizing these findings, the benefits of raising awareness and motivating defensive measures were deemed to outweigh potential misuse by attackers.

Hasty Briefsbeta

A small number of samples can poison LLMs of any size