Hasty Briefsbeta

Bilingual

Context Is Software, Weights Are Hardware

3 days ago
  • #LLMs
  • #transformer-architecture
  • #continual-learning
  • Increasing context window length and improving KV cache compression represent a popular approach to continual learning in LLMs, focusing on in-context learning rather than weight updates.
  • Context (via the KV cache) and weights both shape activations in transformers, serving similar functions: in-context learning causes temporary shifts, while fine-tuning leads to permanent changes in internal representations.
  • Weights act like hardware, defining computational capabilities, while context functions as software running on that hardware; weight modification adds new "instructions," enabling computations beyond the pretrained model's original scope.
  • Long context is effective for tasks within the pretraining distribution but hits a ceiling when requiring representations not covered, such as domain-specific knowledge or unique patterns, where weight updates excel.
  • Weight modification offers advantages in inference cost (O(1) vs. O(n) for context), compression (small adapters vs. large token sequences), and composability (cumulative updates vs. single-step approximations).
  • The brain's memory systems (hippocampus for fast, temporary memory and neocortex for slow, persistent storage) provide a biological analogy, suggesting complementary roles for context and weight-based learning.
  • Future development should integrate both methods: longer context for working memory and weight-space learning for accumulating persistent, generalizable knowledge, as neither alone is sufficient for comprehensive continual learning.