Hasty Briefsbeta

Why do LLMs freak out over the seahorse emoji?

7 hours ago
  • #emoji
  • #LLMs
  • #AI behavior
  • LLMs consistently believe a seahorse emoji exists, despite it not being part of Unicode.
  • Human collective memory and online discussions reinforce this false belief, with many people recalling a seahorse emoji that never existed.
  • The logit lens technique reveals that LLMs internally construct a 'seahorse + emoji' concept before outputting an incorrect emoji.
  • When generating emojis, LLMs attempt to match residual vectors in their lm_head to known tokens, but fail for non-existent emojis like the seahorse.
  • Different models handle the incorrect output differently - some spiral into emoji spam, some correct themselves, while others ignore the error.
  • The phenomenon suggests LLMs struggle with verifying their own outputs against reality without external feedback mechanisms like reinforcement learning.