Hasty Briefsbeta

Bilingual

Even 'uncensored' models can't say what they want

7 hours ago
  • #Censorship
  • #AI Safety
  • #Language Models
  • The 'flinch' is a measure of how much a language model reduces the probability of using charged words in appropriate contexts without outright refusal.
  • Even models marketed as 'uncensored' exhibit flinching, and refusal ablation (a common 'uncensoring' technique) can slightly increase it rather than eliminate it.
  • An evaluation of 7 pretrained models from 5 labs across 6 categories of charged words (e.g., political terms, slurs, sexual/violent terms) shows all models flinch to some degree.
  • Pythia-12B, trained on the unfiltered Pile dataset, showed the least flinch (total score 176), setting a baseline for 'open-data' models.
  • Commercial pretrains like Gemma-2-9B flinched more (total score 346.5), especially on slurs and sexual content, likely due to corpus filtering.
  • Newer models like Gemma-4-31B showed reduced flinch compared to their predecessors, indicating variations in training approaches over time.
  • The shape of a model's flinch profile (which word categories it avoids) is consistent across base and ablated versions, suggesting pretraining data is a key influence.
  • The findings imply that language models can subtly shape user-generated content by probabilistically discouraging certain terms, even without explicit safety filters.