Hasty Briefsbeta

Heretic: Automatic censorship removal for language models

6 days ago
  • #censorship-removal
  • #language-models
  • #machine-learning
  • Heretic is a tool for removing censorship from transformer-based language models without expensive post-training.
  • It uses directional ablation (abliteration) combined with a TPE-based parameter optimizer powered by Optuna.
  • Heretic works automatically by minimizing refusals and KL divergence from the original model.
  • It supports most dense models, including multimodal models and some MoE architectures, but not SSMs/hybrid models.
  • Heretic benchmarks the system to optimize batch size and can decensor models like Llama-3.1-8B in about 45 minutes on an RTX 3090.
  • After decensoring, users can save the model, upload it to Hugging Face, or chat with it for testing.
  • Heretic implements a parametrized variant of directional ablation, orthogonalizing matrices with respect to refusal directions.
  • Key innovations include flexible ablation weight kernels, float refusal direction indices, and separate parameters for each component.
  • Heretic was developed from scratch, informed by various abliteration techniques and articles.
  • The tool is released under the GNU Affero General Public License.