Why do LLMs freak out over the seahorse emoji?

5 hours ago

Copy Link

LLMs consistently believe a seahorse emoji exists, despite it not being part of Unicode.
Human collective memory and online discussions reinforce this false belief, with many people recalling a seahorse emoji that never existed.
The logit lens technique reveals that LLMs internally construct a 'seahorse + emoji' concept before outputting an incorrect emoji.
When generating emojis, LLMs attempt to match residual vectors in their lm_head to known tokens, but fail for non-existent emojis like the seahorse.
Different models handle the incorrect output differently - some spiral into emoji spam, some correct themselves, while others ignore the error.
The phenomenon suggests LLMs struggle with verifying their own outputs against reality without external feedback mechanisms like reinforcement learning.

Hasty Briefsbeta