Why Stacking Sliding Windows Can't See Far
14 days ago
- #transformer-models
- #information-flow
- #sliding-window-attention
- Sliding Window Attention (SWA) restricts each word to only see the last W words, improving efficiency.
- Theoretical receptive field grows as L × W, but practical models struggle beyond ~1,500 words due to information dilution and residual connections.
- Without residual connections, effective receptive field grows as O(W√L), limited by Gaussian spreading of information.
- With residual connections, influence decays exponentially, creating a fixed effective horizon (~1.5×W) independent of depth.
- Residual connections (α ≈ 0.95) create a trade-off: stable training vs. long-range information access.
- Hybrid architectures (local + global attention) may overcome these limitations for long-context models.
- Key formula: Effective horizon D_eff ≈ W × ln(ε)/ln(1-α), where ε is the threshold for negligible influence.