CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs
15 hours ago
- #machine learning systems
- #Transformer optimization
- #GPU kernel design
- Introduces CODA, a GPU kernel abstraction for rewriting Transformer blocks as GEMM-epilogue programs.
- Addresses memory-bound bottleneck from operators like normalization and activations by moving computations on chip before writing to memory.
- Uses a fixed GEMM mainloop with composable epilogue primitives for scaling, reductions, and accumulation.
- Covers nearly all non-attention computation in Transformer forward/backward passes, combining productivity and hardware efficiency.
- Achieves high performance across workloads with both human- and LLM-authored kernels, demonstrating practicality.