Intuitions for Tranformer Circuits

4 hours ago

Mechanistic Interpretability (MI) is the study of ML model internals to understand their behavior from first principles, akin to reverse engineering software.
The residual stream in transformers acts like shared memory, where different components (attention, MLPs) perform loads and stores sequentially through layers.
Attention in transformers determines which source tokens to read from, using 'soft' distributions over token indices, analogous to addressing in memory systems.
Circuits in transformers, such as the QK and OV circuits, are paths for information flow, with QK circuits determining attention patterns and OV circuits specifying what data to move.
Subspace scores are learned coefficients that index into the column dimension of the residual stream, allowing components to read from distinct linear combinations of subspaces.
Induction heads are a specific type of circuit that predict patterns like A B ... A __ by composing with previous-token heads and leveraging learned subspace scores.
The residual stream can be conceptualized as shared memory with 'token:subspace' addressing, where attention computes the token part and subspace scores determine the subspace part.
Understanding transformer circuits and the residual stream is crucial for AI alignment, ensuring models behave as intended and do not engage in harmful or deceptive behavior.

Hasty Briefsbeta