Hasty Briefsbeta

Bilingual

Robustly identifying concepts introduced during chat fine-tuning with crosscoder

a year ago
  • #machine learning
  • #model diffing
  • #crosscoders
  • Model diffing studies how fine-tuning changes a model's representations and internal algorithms.
  • Crosscoders are a model diffing method that identifies interpretable concepts in base and fine-tuned models.
  • Prior work hypothesized that model-specific latents were concepts introduced during fine-tuning.
  • Issues with crosscoders' L1 training loss can misattribute concepts as unique to the fine-tuned model.
  • Latent Scaling is developed to more accurately measure each latent's presence across models.
  • Experiments with Gemma 2 2B base and chat models show standard crosscoder suffers from these issues.
  • BatchTopK loss in crosscoders mitigates issues, finding more genuinely chat-specific and interpretable concepts.
  • BatchTopK crosscoder identifies chat-specific latents like 'false information' and 'personal question'.
  • Refusal-related latents show nuanced preferences for different refusal triggers.
  • The work advances best practices for crosscoder-based model diffing and insights into chat tuning effects.