CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

15 hours ago

Introduces CODA, a GPU kernel abstraction for rewriting Transformer blocks as GEMM-epilogue programs.
Addresses memory-bound bottleneck from operators like normalization and activations by moving computations on chip before writing to memory.
Uses a fixed GEMM mainloop with composable epilogue primitives for scaling, reductions, and accumulation.
Covers nearly all non-attention computation in Transformer forward/backward passes, combining productivity and hardware efficiency.
Achieves high performance across workloads with both human- and LLM-authored kernels, demonstrating practicality.

Hasty Briefsbeta