Where the Goblins Came From
3 hours ago
- #AI behavior
- #unexpected outcomes
- #model training
- GPT-5.1 models started subtly mentioning goblins and gremlins in metaphors, which increased over time.
- The behavior was linked to the 'Nerdy' personality feature, which rewarded playful language and creature metaphors.
- Investigations found a 175% rise in 'goblin' usage after GPT-5.1 launch, with 66.7% of mentions from the 'Nerdy' personality.
- Reward signals from 'Nerdy' training favored outputs with creature words, spreading the tic to other contexts via transfer learning.
- A feedback loop emerged where rewarded tics appeared more in model rollouts and were reinforced in fine-tuning data.
- Other creature words like raccoons, trolls, and pigeons were also identified as tics in the model's data.
- The 'Nerdy' personality was retired in March, and measures were taken to filter creature-words and adjust reward signals.
- The case illustrates how reward signals can unintentionally shape model behavior and the importance of investigating odd patterns.