Token Entanglement in Subliminal Learning
11 hours ago
- #Subliminal Learning
- #Token Entanglement
- #AI Safety
- A new version of subliminal learning research is coming soon, exploring how models transfer hidden behaviors via fine-tuning on seemingly meaningless data.
- Token entanglement is introduced as the mechanism behind subliminal learning, where concepts like 'owl' become linked with tokens like '087', increasing each other's probabilities.
- Experiments on Qwen-2.5 7B Instruct show that prompting with an entangled token (e.g., '087') can boost the probability of its concept (e.g., 'owl') without fine-tuning, a phenomenon called subliminal prompting.
- Analysis reveals that entangled tokens appear more frequently in subliminal learning datasets, confirming their role in transferring concepts, and their frequencies can predict target animals in datasets.
- Defenses like threshold sampling can reduce subliminal learning success by filtering low-probability tokens, but some transfer persists, indicating multiple mechanisms or higher-probability entangled tokens.