Hasty Briefsbeta

Bilingual

PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free

7 hours ago
  • #Over-Defense Mitigation
  • #Prompt Injection Defense
  • #LLM Security
  • Prompt injection attacks threaten LLMs by enabling goal hijacking and data leakage.
  • Over-defense in prompt guard models, like falsely flagging benign inputs due to trigger word bias, is addressed by NotInject dataset and PIGuard model.
  • NotInject is a dataset with 339 benign samples containing trigger words, used to evaluate over-defense; it shows existing models' accuracy drops to near 60%.
  • PIGuard incorporates MOF training strategy to reduce trigger word bias, improving performance by 30.8% over existing models on benchmarks.
  • PIGuard is a lightweight, open-source model with 184MB parameters, achieving competitive performance against models like GPT-4.