Break the quadratic wall of Transformer attention: WERSA, paper+code open source
9 months ago
- #Machine-Learning
- #Attention-Mechanism
- #Transformer
- WERSA is a novel attention mechanism with linear O(n) time complexity, designed for scaling Transformer models to very long sequences.
- It combines multi-resolution analysis (Haar wavelet transforms), adaptive filtering (MLP-generated filters), and random feature projection for linear complexity.
- Installation requires PyTorch, Hugging Face Transformers, and the WERSA package from the repository.
- Quickstart examples show how to build Qwen-like models with WERSA, including 8B and 0.6B parameter configurations.
- Includes training scripts for pre-training models from scratch and testing generation capabilities.
- The project is licensed under Apache License 2.0 and the paper is available on arXiv.