Analog in-memory computing attention mechanism for fast energy-efficient LLMs

19 hours ago

Copy Link

Transformer networks rely on self-attention mechanisms, which are crucial for large language models (LLMs).
Current GPU-based systems face latency and energy bottlenecks due to frequent loading of token projections into SRAM.
A novel in-memory computing architecture using gain cells is proposed to store token projections and perform analog dot-product computations for self-attention.
Gain cells offer advantages like fast writes, high endurance, and multi-level storage, making them suitable for dynamic KV cache updates.
The architecture avoids power-intensive ADCs by using charge-to-pulse circuits for analog signal processing.
A hardware-aware adaptation algorithm enables mapping pre-trained models (e.g., GPT-2) to the non-ideal gain-cell-based system without retraining from scratch.
The design reduces attention latency by up to two orders of magnitude and energy consumption by up to four orders compared to GPUs.
Sliding window attention is implemented to limit memory scaling with sequence length while maintaining performance.
The architecture supports 3D integration for scalability, with area estimates showing compact footprints for multi-head attention.
Benchmarks show comparable accuracy to GPT-2 on NLP tasks, validating the feasibility of analog in-memory computing for LLMs.

Hasty Briefsbeta