Refusal in Language Models Is Mediated by a Single Direction

4 hours ago

Language models exhibit refusal behavior for harmful instructions through safety fine-tuning.
Refusal is controlled by a one-dimensional subspace across multiple open-source models.
Modifying this direction can prevent refusal or induce it in harmless requests.
A white-box jailbreak method can disable refusal with minimal impact on other capabilities.
Adversarial suffixes suppress the refusal direction, revealing safety mechanisms' brittleness.

Hasty Briefsbeta