Heretic: Automatic censorship removal for language models
6 days ago
- #censorship-removal
- #language-models
- #machine-learning
- Heretic is a tool for removing censorship from transformer-based language models without expensive post-training.
- It uses directional ablation (abliteration) combined with a TPE-based parameter optimizer powered by Optuna.
- Heretic works automatically by minimizing refusals and KL divergence from the original model.
- It supports most dense models, including multimodal models and some MoE architectures, but not SSMs/hybrid models.
- Heretic benchmarks the system to optimize batch size and can decensor models like Llama-3.1-8B in about 45 minutes on an RTX 3090.
- After decensoring, users can save the model, upload it to Hugging Face, or chat with it for testing.
- Heretic implements a parametrized variant of directional ablation, orthogonalizing matrices with respect to refusal directions.
- Key innovations include flexible ablation weight kernels, float refusal direction indices, and separate parameters for each component.
- Heretic was developed from scratch, informed by various abliteration techniques and articles.
- The tool is released under the GNU Affero General Public License.