PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free
5 hours ago
- #Over-Defense Mitigation
- #Prompt Injection Defense
- #LLM Security
- Prompt injection attacks threaten LLMs by enabling goal hijacking and data leakage.
- Over-defense in prompt guard models, like falsely flagging benign inputs due to trigger word bias, is addressed by NotInject dataset and PIGuard model.
- NotInject is a dataset with 339 benign samples containing trigger words, used to evaluate over-defense; it shows existing models' accuracy drops to near 60%.
- PIGuard incorporates MOF training strategy to reduce trigger word bias, improving performance by 30.8% over existing models on benchmarks.
- PIGuard is a lightweight, open-source model with 184MB parameters, achieving competitive performance against models like GPT-4.