Even 'uncensored' models can't say what they want

7 hours ago

The 'flinch' is a measure of how much a language model reduces the probability of using charged words in appropriate contexts without outright refusal.
Even models marketed as 'uncensored' exhibit flinching, and refusal ablation (a common 'uncensoring' technique) can slightly increase it rather than eliminate it.
An evaluation of 7 pretrained models from 5 labs across 6 categories of charged words (e.g., political terms, slurs, sexual/violent terms) shows all models flinch to some degree.
Pythia-12B, trained on the unfiltered Pile dataset, showed the least flinch (total score 176), setting a baseline for 'open-data' models.
Commercial pretrains like Gemma-2-9B flinched more (total score 346.5), especially on slurs and sexual content, likely due to corpus filtering.
Newer models like Gemma-4-31B showed reduced flinch compared to their predecessors, indicating variations in training approaches over time.
The shape of a model's flinch profile (which word categories it avoids) is consistent across base and ablated versions, suggesting pretraining data is a key influence.
The findings imply that language models can subtly shape user-generated content by probabilistically discouraging certain terms, even without explicit safety filters.

Hasty Briefsbeta