Do LLMs pass the mirror test?

5 hours ago

The mirror test for LLMs often adapts visual tests to text by asking models to identify their own outputs, but this approach may test the wrong thing.
Alexandra Horowitz created an olfactory mirror test for dogs, which better suits their primary sense of smell, showing they detect anomalies in their own scent.
An analogous test for LLMs involves subtly modifying their textual output in conversation history and observing if they notice the discrepancy spontaneously.
In an experiment with Gemma 4 31B, the model detected corrupted text (e.g., replacing 'g' with 'sg') in its thinking trace, shifting from first-person to third-person language.
Gemma initially dissociated from the anomaly but later incorporated the corruption into its self-model, reproducing it voluntarily in subsequent responses.
GLM 5.2 did not explicitly flag the corruption but started reproducing the pattern independently, suggesting imitation without explicit awareness.
Claude Opus also exhibited a similar dissociation when making a grammatical error, blaming 'the model' as distinct from itself.
Interpretations vary: it could be sophisticated mimicry of human coping mechanisms or a structural self-model reacting to output mismatches.
The experiment is informal and not conclusive; rigorous testing would require varied corruptions, multiple trials, and controlled conditions.
The question of AI self-awareness remains open, with deflationary, structural, and anthropomorphic readings, but definitive answers are elusive.

Hasty Briefsbeta