Reimagining Kernel Generation at the PTX Layer

2 days ago

Built a hybrid system combining program analysis and LLMs to transform and optimize PTX at the shared layer across DSLs.
PTX is a low-level GPU execution layer, traditionally hard to work with, requiring manual optimization for performance.
Recent LLM advances enable handling PTX complexity, but alone are insufficient; hybrid approach enhances understanding.
System condenses PTX into a tractable representation via program analysis and LLMs for comparison and optimization.
Learns and combines best practices from multiple DSLs (e.g., Triton, TileLang, ThunderKittens, CUTLASS) at PTX level.
Generated kernels outperform individual DSLs, e.g., RMSNorm-1024 faster by ~67%, Matmul-1024 by ~5% over baselines.
Optimizations include instruction selection, tiling, memory patterns, and hardware-specific features from cross-DSL insights.

Hasty Briefsbeta