Hasty Briefsbeta

Bilingual

Intuitions for Tranformer Circuits

4 hours ago
  • #Residual Stream
  • #Transformer Circuits
  • #Mechanistic Interpretability
  • Mechanistic Interpretability (MI) is the study of ML model internals to understand their behavior from first principles, akin to reverse engineering software.
  • The residual stream in transformers acts like shared memory, where different components (attention, MLPs) perform loads and stores sequentially through layers.
  • Attention in transformers determines which source tokens to read from, using 'soft' distributions over token indices, analogous to addressing in memory systems.
  • Circuits in transformers, such as the QK and OV circuits, are paths for information flow, with QK circuits determining attention patterns and OV circuits specifying what data to move.
  • Subspace scores are learned coefficients that index into the column dimension of the residual stream, allowing components to read from distinct linear combinations of subspaces.
  • Induction heads are a specific type of circuit that predict patterns like A B ... A __ by composing with previous-token heads and leveraging learned subspace scores.
  • The residual stream can be conceptualized as shared memory with 'token:subspace' addressing, where attention computes the token part and subspace scores determine the subspace part.
  • Understanding transformer circuits and the residual stream is crucial for AI alignment, ensuring models behave as intended and do not engage in harmful or deceptive behavior.