Robustly identifying concepts introduced during chat fine-tuning with crosscoder
a year ago
- #machine learning
- #model diffing
- #crosscoders
- Model diffing studies how fine-tuning changes a model's representations and internal algorithms.
- Crosscoders are a model diffing method that identifies interpretable concepts in base and fine-tuned models.
- Prior work hypothesized that model-specific latents were concepts introduced during fine-tuning.
- Issues with crosscoders' L1 training loss can misattribute concepts as unique to the fine-tuned model.
- Latent Scaling is developed to more accurately measure each latent's presence across models.
- Experiments with Gemma 2 2B base and chat models show standard crosscoder suffers from these issues.
- BatchTopK loss in crosscoders mitigates issues, finding more genuinely chat-specific and interpretable concepts.
- BatchTopK crosscoder identifies chat-specific latents like 'false information' and 'personal question'.
- Refusal-related latents show nuanced preferences for different refusal triggers.
- The work advances best practices for crosscoder-based model diffing and insights into chat tuning effects.