Token Entanglement in Subliminal Learning

11 hours ago

A new version of subliminal learning research is coming soon, exploring how models transfer hidden behaviors via fine-tuning on seemingly meaningless data.
Token entanglement is introduced as the mechanism behind subliminal learning, where concepts like 'owl' become linked with tokens like '087', increasing each other's probabilities.
Experiments on Qwen-2.5 7B Instruct show that prompting with an entangled token (e.g., '087') can boost the probability of its concept (e.g., 'owl') without fine-tuning, a phenomenon called subliminal prompting.
Analysis reveals that entangled tokens appear more frequently in subliminal learning datasets, confirming their role in transferring concepts, and their frequencies can predict target animals in datasets.
Defenses like threshold sampling can reduce subliminal learning success by filtering low-probability tokens, but some transfer persists, indicating multiple mechanisms or higher-probability entangled tokens.

Hasty Briefsbeta