Hasty Briefsbeta

Bilingual

CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

15 hours ago
  • #machine learning systems
  • #Transformer optimization
  • #GPU kernel design
  • Introduces CODA, a GPU kernel abstraction for rewriting Transformer blocks as GEMM-epilogue programs.
  • Addresses memory-bound bottleneck from operators like normalization and activations by moving computations on chip before writing to memory.
  • Uses a fixed GEMM mainloop with composable epilogue primitives for scaling, reductions, and accumulation.
  • Covers nearly all non-attention computation in Transformer forward/backward passes, combining productivity and hardware efficiency.
  • Achieves high performance across workloads with both human- and LLM-authored kernels, demonstrating practicality.