Analog in-memory computing attention mechanism for fast energy-efficient LLMs
17 hours ago
- #Energy Efficiency
- #Transformer Networks
- #In-Memory Computing
- Transformer networks rely on self-attention mechanisms, which are crucial for large language models (LLMs).
- Current GPU-based systems face latency and energy bottlenecks due to frequent loading of token projections into SRAM.
- A novel in-memory computing architecture using gain cells is proposed to store token projections and perform analog dot-product computations for self-attention.
- Gain cells offer advantages like fast writes, high endurance, and multi-level storage, making them suitable for dynamic KV cache updates.
- The architecture avoids power-intensive ADCs by using charge-to-pulse circuits for analog signal processing.
- A hardware-aware adaptation algorithm enables mapping pre-trained models (e.g., GPT-2) to the non-ideal gain-cell-based system without retraining from scratch.
- The design reduces attention latency by up to two orders of magnitude and energy consumption by up to four orders compared to GPUs.
- Sliding window attention is implemented to limit memory scaling with sequence length while maintaining performance.
- The architecture supports 3D integration for scalability, with area estimates showing compact footprints for multi-head attention.
- Benchmarks show comparable accuracy to GPT-2 on NLP tasks, validating the feasibility of analog in-memory computing for LLMs.