Hasty Briefsbeta

Bilingual

A Tiny Compiler for Data-Parallel Kernels

2 days ago
  • #SIMD
  • #parallelism
  • #compiler
  • A compiler transformation rewrites regular loops to allow multiple iterations to run in parallel using grouped execution.
  • The compiler classifies values as uniform (same across lanes) or varying (different per lane) to guide lowering decisions.
  • Uniform values are shared, while varying values require per-lane computation, affecting memory access patterns.
  • For loops are replaced with vector_for loops, where lanes represent independent positions in grouped execution.
  • Masks handle cases where data isn't divisible by lane count, skipping out-of-bounds lanes.
  • Loads with contiguous indices become masked_loads, while non-contiguous varying loads become gathers.
  • Gathers allow parallel reads from different addresses, though often slower than contiguous accesses.
  • The lowering step makes parallelism explicit, enabling later code generation to use faster instructions.
  • The compiler's input is hand-written ASTs, outputting a lowered IR, focusing on dependency analysis.
  • Code examples include scaling audio and color mapping kernels, demonstrating uniform vs. varying value handling.