Do LLMs pass the mirror test?
5 hours ago
- #LLM self-awareness
- #anomaly detection
- #mirror test
- The mirror test for LLMs often adapts visual tests to text by asking models to identify their own outputs, but this approach may test the wrong thing.
- Alexandra Horowitz created an olfactory mirror test for dogs, which better suits their primary sense of smell, showing they detect anomalies in their own scent.
- An analogous test for LLMs involves subtly modifying their textual output in conversation history and observing if they notice the discrepancy spontaneously.
- In an experiment with Gemma 4 31B, the model detected corrupted text (e.g., replacing 'g' with 'sg') in its thinking trace, shifting from first-person to third-person language.
- Gemma initially dissociated from the anomaly but later incorporated the corruption into its self-model, reproducing it voluntarily in subsequent responses.
- GLM 5.2 did not explicitly flag the corruption but started reproducing the pattern independently, suggesting imitation without explicit awareness.
- Claude Opus also exhibited a similar dissociation when making a grammatical error, blaming 'the model' as distinct from itself.
- Interpretations vary: it could be sophisticated mimicry of human coping mechanisms or a structural self-model reacting to output mismatches.
- The experiment is informal and not conclusive; rigorous testing would require varied corruptions, multiple trials, and controlled conditions.
- The question of AI self-awareness remains open, with deflationary, structural, and anthropomorphic readings, but definitive answers are elusive.