Hasty Briefsbeta

Bilingual

The emerging role of SRAM-centric chips in AI inference

12 hours ago
  • #hardware-optimization
  • #SRAM-architecture
  • #AI-inference
  • SRAM-centric chips (e.g., Cerebras, Groq, d-Matrix) are gaining traction in AI inference due to their latency and throughput advantages over GPUs.
  • Key tradeoff between SRAM and HBM (DRAM): SRAM is faster and on-chip, while HBM is denser but slower and off-chip.
  • SRAM uses 6 transistors per bit, offering fast reads (~1 ns), while DRAM uses 1 transistor and 1 capacitor, with slower reads (~10-15 ns).
  • GPU memory hierarchy includes SRAM (registers, shared memory, caches) and DRAM (HBM), balancing speed and density.
  • SRAM-centric architectures prioritize on-chip memory, sacrificing compute capacity for higher memory bandwidth, ideal for memory-bound workloads.
  • Inference workloads, especially decode phases, benefit from SRAM-centric designs due to high memory bandwidth needs.
  • Training workloads favor GPUs due to high arithmetic intensity, while inference (especially decode) is memory-bound, favoring SRAM-centric chips.
  • Disaggregation techniques (e.g., prefill vs. decode) optimize performance by mapping phases to suitable hardware (GPUs for compute, SRAM-centric for memory).
  • Future trends include specialized memory architectures (e.g., stacked DRAM, GDDR7) to address the memory wall in LLM inference.
  • Gimlet's multi-silicon inference cloud leverages both GPUs and SRAM-centric chips, optimizing workload distribution for efficiency.