Hasty Briefsbeta

Analog in-memory computing attention mechanism for fast energy-efficient LLMs

19 hours ago
  • #Energy Efficiency
  • #Transformer Networks
  • #In-Memory Computing
  • Transformer networks rely on self-attention mechanisms, which are crucial for large language models (LLMs).
  • Current GPU-based systems face latency and energy bottlenecks due to frequent loading of token projections into SRAM.
  • A novel in-memory computing architecture using gain cells is proposed to store token projections and perform analog dot-product computations for self-attention.
  • Gain cells offer advantages like fast writes, high endurance, and multi-level storage, making them suitable for dynamic KV cache updates.
  • The architecture avoids power-intensive ADCs by using charge-to-pulse circuits for analog signal processing.
  • A hardware-aware adaptation algorithm enables mapping pre-trained models (e.g., GPT-2) to the non-ideal gain-cell-based system without retraining from scratch.
  • The design reduces attention latency by up to two orders of magnitude and energy consumption by up to four orders compared to GPUs.
  • Sliding window attention is implemented to limit memory scaling with sequence length while maintaining performance.
  • The architecture supports 3D integration for scalability, with area estimates showing compact footprints for multi-head attention.
  • Benchmarks show comparable accuracy to GPT-2 on NLP tasks, validating the feasibility of analog in-memory computing for LLMs.