Hasty Briefsbeta

Bilingual

Token Entanglement in Subliminal Learning

11 hours ago
  • #Subliminal Learning
  • #Token Entanglement
  • #AI Safety
  • A new version of subliminal learning research is coming soon, exploring how models transfer hidden behaviors via fine-tuning on seemingly meaningless data.
  • Token entanglement is introduced as the mechanism behind subliminal learning, where concepts like 'owl' become linked with tokens like '087', increasing each other's probabilities.
  • Experiments on Qwen-2.5 7B Instruct show that prompting with an entangled token (e.g., '087') can boost the probability of its concept (e.g., 'owl') without fine-tuning, a phenomenon called subliminal prompting.
  • Analysis reveals that entangled tokens appear more frequently in subliminal learning datasets, confirming their role in transferring concepts, and their frequencies can predict target animals in datasets.
  • Defenses like threshold sampling can reduce subliminal learning success by filtering low-probability tokens, but some transfer persists, indicating multiple mechanisms or higher-probability entangled tokens.