Heretic: Automatic censorship removal for language models

6 days ago

Copy Link

Heretic is a tool for removing censorship from transformer-based language models without expensive post-training.
It uses directional ablation (abliteration) combined with a TPE-based parameter optimizer powered by Optuna.
Heretic works automatically by minimizing refusals and KL divergence from the original model.
It supports most dense models, including multimodal models and some MoE architectures, but not SSMs/hybrid models.
Heretic benchmarks the system to optimize batch size and can decensor models like Llama-3.1-8B in about 45 minutes on an RTX 3090.
After decensoring, users can save the model, upload it to Hugging Face, or chat with it for testing.
Heretic implements a parametrized variant of directional ablation, orthogonalizing matrices with respect to refusal directions.
Key innovations include flexible ablation weight kernels, float refusal direction indices, and separate parameters for each component.
Heretic was developed from scratch, informed by various abliteration techniques and articles.
The tool is released under the GNU Affero General Public License.

Hasty Briefsbeta