Robustly identifying concepts introduced during chat fine-tuning with crosscoder

a year ago

Model diffing studies how fine-tuning changes a model's representations and internal algorithms.
Crosscoders are a model diffing method that identifies interpretable concepts in base and fine-tuned models.
Prior work hypothesized that model-specific latents were concepts introduced during fine-tuning.
Issues with crosscoders' L1 training loss can misattribute concepts as unique to the fine-tuned model.
Latent Scaling is developed to more accurately measure each latent's presence across models.
Experiments with Gemma 2 2B base and chat models show standard crosscoder suffers from these issues.
BatchTopK loss in crosscoders mitigates issues, finding more genuinely chat-specific and interpretable concepts.
BatchTopK crosscoder identifies chat-specific latents like 'false information' and 'personal question'.
Refusal-related latents show nuanced preferences for different refusal triggers.
The work advances best practices for crosscoder-based model diffing and insights into chat tuning effects.

Hasty Briefsbeta