Hasty Briefsbeta

Bilingual

Refusal in Language Models Is Mediated by a Single Direction

5 hours ago
  • #language-models
  • #model-safety
  • #refusal-behavior
  • Language models exhibit refusal behavior for harmful instructions through safety fine-tuning.
  • Refusal is controlled by a one-dimensional subspace across multiple open-source models.
  • Modifying this direction can prevent refusal or induce it in harmless requests.
  • A white-box jailbreak method can disable refusal with minimal impact on other capabilities.
  • Adversarial suffixes suppress the refusal direction, revealing safety mechanisms' brittleness.