The emerging role of SRAM-centric chips in AI inference
12 hours ago
- #hardware-optimization
- #SRAM-architecture
- #AI-inference
- SRAM-centric chips (e.g., Cerebras, Groq, d-Matrix) are gaining traction in AI inference due to their latency and throughput advantages over GPUs.
- Key tradeoff between SRAM and HBM (DRAM): SRAM is faster and on-chip, while HBM is denser but slower and off-chip.
- SRAM uses 6 transistors per bit, offering fast reads (~1 ns), while DRAM uses 1 transistor and 1 capacitor, with slower reads (~10-15 ns).
- GPU memory hierarchy includes SRAM (registers, shared memory, caches) and DRAM (HBM), balancing speed and density.
- SRAM-centric architectures prioritize on-chip memory, sacrificing compute capacity for higher memory bandwidth, ideal for memory-bound workloads.
- Inference workloads, especially decode phases, benefit from SRAM-centric designs due to high memory bandwidth needs.
- Training workloads favor GPUs due to high arithmetic intensity, while inference (especially decode) is memory-bound, favoring SRAM-centric chips.
- Disaggregation techniques (e.g., prefill vs. decode) optimize performance by mapping phases to suitable hardware (GPUs for compute, SRAM-centric for memory).
- Future trends include specialized memory architectures (e.g., stacked DRAM, GDDR7) to address the memory wall in LLM inference.
- Gimlet's multi-silicon inference cloud leverages both GPUs and SRAM-centric chips, optimizing workload distribution for efficiency.