Context Is Software, Weights Are Hardware
3 days ago
- #LLMs
- #transformer-architecture
- #continual-learning
- Increasing context window length and improving KV cache compression represent a popular approach to continual learning in LLMs, focusing on in-context learning rather than weight updates.
- Context (via the KV cache) and weights both shape activations in transformers, serving similar functions: in-context learning causes temporary shifts, while fine-tuning leads to permanent changes in internal representations.
- Weights act like hardware, defining computational capabilities, while context functions as software running on that hardware; weight modification adds new "instructions," enabling computations beyond the pretrained model's original scope.
- Long context is effective for tasks within the pretraining distribution but hits a ceiling when requiring representations not covered, such as domain-specific knowledge or unique patterns, where weight updates excel.
- Weight modification offers advantages in inference cost (O(1) vs. O(n) for context), compression (small adapters vs. large token sequences), and composability (cumulative updates vs. single-step approximations).
- The brain's memory systems (hippocampus for fast, temporary memory and neocortex for slow, persistent storage) provide a biological analogy, suggesting complementary roles for context and weight-based learning.
- Future development should integrate both methods: longer context for working memory and weight-space learning for accumulating persistent, generalizable knowledge, as neither alone is sufficient for comprehensive continual learning.