Hasty Briefsbeta

Why Stacking Sliding Windows Can't See Far

14 days ago
  • #transformer-models
  • #information-flow
  • #sliding-window-attention
  • Sliding Window Attention (SWA) restricts each word to only see the last W words, improving efficiency.
  • Theoretical receptive field grows as L × W, but practical models struggle beyond ~1,500 words due to information dilution and residual connections.
  • Without residual connections, effective receptive field grows as O(W√L), limited by Gaussian spreading of information.
  • With residual connections, influence decays exponentially, creating a fixed effective horizon (~1.5×W) independent of depth.
  • Residual connections (α ≈ 0.95) create a trade-off: stable training vs. long-range information access.
  • Hybrid architectures (local + global attention) may overcome these limitations for long-context models.
  • Key formula: Effective horizon D_eff ≈ W × ln(ε)/ln(1-α), where ε is the threshold for negligible influence.