Refusal in Language Models Is Mediated by a Single Direction
4 hours ago
- #language-models
- #model-safety
- #refusal-behavior
- Language models exhibit refusal behavior for harmful instructions through safety fine-tuning.
- Refusal is controlled by a one-dimensional subspace across multiple open-source models.
- Modifying this direction can prevent refusal or induce it in harmless requests.
- A white-box jailbreak method can disable refusal with minimal impact on other capabilities.
- Adversarial suffixes suppress the refusal direction, revealing safety mechanisms' brittleness.