The emerging role of SRAM-centric chips in AI inference

12 hours ago

SRAM-centric chips (e.g., Cerebras, Groq, d-Matrix) are gaining traction in AI inference due to their latency and throughput advantages over GPUs.
Key tradeoff between SRAM and HBM (DRAM): SRAM is faster and on-chip, while HBM is denser but slower and off-chip.
SRAM uses 6 transistors per bit, offering fast reads (~1 ns), while DRAM uses 1 transistor and 1 capacitor, with slower reads (~10-15 ns).
GPU memory hierarchy includes SRAM (registers, shared memory, caches) and DRAM (HBM), balancing speed and density.
SRAM-centric architectures prioritize on-chip memory, sacrificing compute capacity for higher memory bandwidth, ideal for memory-bound workloads.
Inference workloads, especially decode phases, benefit from SRAM-centric designs due to high memory bandwidth needs.
Training workloads favor GPUs due to high arithmetic intensity, while inference (especially decode) is memory-bound, favoring SRAM-centric chips.
Disaggregation techniques (e.g., prefill vs. decode) optimize performance by mapping phases to suitable hardware (GPUs for compute, SRAM-centric for memory).
Future trends include specialized memory architectures (e.g., stacked DRAM, GDDR7) to address the memory wall in LLM inference.
Gimlet's multi-silicon inference cloud leverages both GPUs and SRAM-centric chips, optimizing workload distribution for efficiency.

Hasty Briefsbeta