Hasty Briefsbeta

Bilingual

Reimagining Kernel Generation at the PTX Layer

2 days ago
  • #LLMs
  • #PTX
  • #GPU Optimization
  • Built a hybrid system combining program analysis and LLMs to transform and optimize PTX at the shared layer across DSLs.
  • PTX is a low-level GPU execution layer, traditionally hard to work with, requiring manual optimization for performance.
  • Recent LLM advances enable handling PTX complexity, but alone are insufficient; hybrid approach enhances understanding.
  • System condenses PTX into a tractable representation via program analysis and LLMs for comparison and optimization.
  • Learns and combines best practices from multiple DSLs (e.g., Triton, TileLang, ThunderKittens, CUTLASS) at PTX level.
  • Generated kernels outperform individual DSLs, e.g., RMSNorm-1024 faster by ~67%, Matmul-1024 by ~5% over baselines.
  • Optimizations include instruction selection, tiling, memory patterns, and hardware-specific features from cross-DSL insights.