Why Stacking Sliding Windows Can't See Far

14 days ago

Copy Link

Sliding Window Attention (SWA) restricts each word to only see the last W words, improving efficiency.
Theoretical receptive field grows as L × W, but practical models struggle beyond ~1,500 words due to information dilution and residual connections.
Without residual connections, effective receptive field grows as O(W√L), limited by Gaussian spreading of information.
With residual connections, influence decays exponentially, creating a fixed effective horizon (~1.5×W) independent of depth.
Residual connections (α ≈ 0.95) create a trade-off: stable training vs. long-range information access.
Hybrid architectures (local + global attention) may overcome these limitations for long-context models.
Key formula: Effective horizon D_eff ≈ W × ln(ε)/ln(1-α), where ε is the threshold for negligible influence.

Hasty Briefsbeta