Where the Goblins Came From

3 hours ago

GPT-5.1 models started subtly mentioning goblins and gremlins in metaphors, which increased over time.
The behavior was linked to the 'Nerdy' personality feature, which rewarded playful language and creature metaphors.
Investigations found a 175% rise in 'goblin' usage after GPT-5.1 launch, with 66.7% of mentions from the 'Nerdy' personality.
Reward signals from 'Nerdy' training favored outputs with creature words, spreading the tic to other contexts via transfer learning.
A feedback loop emerged where rewarded tics appeared more in model rollouts and were reinforced in fine-tuning data.
Other creature words like raccoons, trolls, and pigeons were also identified as tics in the model's data.
The 'Nerdy' personality was retired in March, and measures were taken to filter creature-words and adjust reward signals.
The case illustrates how reward signals can unintentionally shape model behavior and the importance of investigating odd patterns.

Hasty Briefsbeta