Intuitions for Tranformer Circuits
4 hours ago
- #Residual Stream
- #Transformer Circuits
- #Mechanistic Interpretability
- Mechanistic Interpretability (MI) is the study of ML model internals to understand their behavior from first principles, akin to reverse engineering software.
- The residual stream in transformers acts like shared memory, where different components (attention, MLPs) perform loads and stores sequentially through layers.
- Attention in transformers determines which source tokens to read from, using 'soft' distributions over token indices, analogous to addressing in memory systems.
- Circuits in transformers, such as the QK and OV circuits, are paths for information flow, with QK circuits determining attention patterns and OV circuits specifying what data to move.
- Subspace scores are learned coefficients that index into the column dimension of the residual stream, allowing components to read from distinct linear combinations of subspaces.
- Induction heads are a specific type of circuit that predict patterns like A B ... A __ by composing with previous-token heads and leveraging learned subspace scores.
- The residual stream can be conceptualized as shared memory with 'token:subspace' addressing, where attention computes the token part and subspace scores determine the subspace part.
- Understanding transformer circuits and the residual stream is crucial for AI alignment, ensuring models behave as intended and do not engage in harmful or deceptive behavior.