Reimagining Kernel Generation at the PTX Layer
2 days ago
- #LLMs
- #PTX
- #GPU Optimization
- Built a hybrid system combining program analysis and LLMs to transform and optimize PTX at the shared layer across DSLs.
- PTX is a low-level GPU execution layer, traditionally hard to work with, requiring manual optimization for performance.
- Recent LLM advances enable handling PTX complexity, but alone are insufficient; hybrid approach enhances understanding.
- System condenses PTX into a tractable representation via program analysis and LLMs for comparison and optimization.
- Learns and combines best practices from multiple DSLs (e.g., Triton, TileLang, ThunderKittens, CUTLASS) at PTX level.
- Generated kernels outperform individual DSLs, e.g., RMSNorm-1024 faster by ~67%, Matmul-1024 by ~5% over baselines.
- Optimizations include instruction selection, tiling, memory patterns, and hardware-specific features from cross-DSL insights.