Break the quadratic wall of Transformer attention: WERSA, paper+code open source

9 months ago

WERSA is a novel attention mechanism with linear O(n) time complexity, designed for scaling Transformer models to very long sequences.
It combines multi-resolution analysis (Haar wavelet transforms), adaptive filtering (MLP-generated filters), and random feature projection for linear complexity.
Installation requires PyTorch, Hugging Face Transformers, and the WERSA package from the repository.
Quickstart examples show how to build Qwen-like models with WERSA, including 8B and 0.6B parameter configurations.
Includes training scripts for pre-training models from scratch and testing generation capabilities.
The project is licensed under Apache License 2.0 and the paper is available on arXiv.

Hasty Briefsbeta