Even 'uncensored' models can't say what they want
7 hours ago
- #Censorship
- #AI Safety
- #Language Models
- The 'flinch' is a measure of how much a language model reduces the probability of using charged words in appropriate contexts without outright refusal.
- Even models marketed as 'uncensored' exhibit flinching, and refusal ablation (a common 'uncensoring' technique) can slightly increase it rather than eliminate it.
- An evaluation of 7 pretrained models from 5 labs across 6 categories of charged words (e.g., political terms, slurs, sexual/violent terms) shows all models flinch to some degree.
- Pythia-12B, trained on the unfiltered Pile dataset, showed the least flinch (total score 176), setting a baseline for 'open-data' models.
- Commercial pretrains like Gemma-2-9B flinched more (total score 346.5), especially on slurs and sexual content, likely due to corpus filtering.
- Newer models like Gemma-4-31B showed reduced flinch compared to their predecessors, indicating variations in training approaches over time.
- The shape of a model's flinch profile (which word categories it avoids) is consistent across base and ablated versions, suggesting pretraining data is a key influence.
- The findings imply that language models can subtly shape user-generated content by probabilistically discouraging certain terms, even without explicit safety filters.