Why do LLMs freak out over the seahorse emoji?
5 hours ago
- #emoji
- #LLMs
- #AI behavior
- LLMs consistently believe a seahorse emoji exists, despite it not being part of Unicode.
- Human collective memory and online discussions reinforce this false belief, with many people recalling a seahorse emoji that never existed.
- The logit lens technique reveals that LLMs internally construct a 'seahorse + emoji' concept before outputting an incorrect emoji.
- When generating emojis, LLMs attempt to match residual vectors in their lm_head to known tokens, but fail for non-existent emojis like the seahorse.
- Different models handle the incorrect output differently - some spiral into emoji spam, some correct themselves, while others ignore the error.
- The phenomenon suggests LLMs struggle with verifying their own outputs against reality without external feedback mechanisms like reinforcement learning.