Hasty Briefsbeta

Bilingual

Break the quadratic wall of Transformer attention: WERSA, paper+code open source

9 months ago
  • #Machine-Learning
  • #Attention-Mechanism
  • #Transformer
  • WERSA is a novel attention mechanism with linear O(n) time complexity, designed for scaling Transformer models to very long sequences.
  • It combines multi-resolution analysis (Haar wavelet transforms), adaptive filtering (MLP-generated filters), and random feature projection for linear complexity.
  • Installation requires PyTorch, Hugging Face Transformers, and the WERSA package from the repository.
  • Quickstart examples show how to build Qwen-like models with WERSA, including 8B and 0.6B parameter configurations.
  • Includes training scripts for pre-training models from scratch and testing generation capabilities.
  • The project is licensed under Apache License 2.0 and the paper is available on arXiv.