A Tiny Compiler for Data-Parallel Kernels
2 days ago
- #SIMD
- #parallelism
- #compiler
- A compiler transformation rewrites regular loops to allow multiple iterations to run in parallel using grouped execution.
- The compiler classifies values as uniform (same across lanes) or varying (different per lane) to guide lowering decisions.
- Uniform values are shared, while varying values require per-lane computation, affecting memory access patterns.
- For loops are replaced with vector_for loops, where lanes represent independent positions in grouped execution.
- Masks handle cases where data isn't divisible by lane count, skipping out-of-bounds lanes.
- Loads with contiguous indices become masked_loads, while non-contiguous varying loads become gathers.
- Gathers allow parallel reads from different addresses, though often slower than contiguous accesses.
- The lowering step makes parallelism explicit, enabling later code generation to use faster instructions.
- The compiler's input is hand-written ASTs, outputting a lowered IR, focusing on dependency analysis.
- Code examples include scaling audio and color mapping kernels, demonstrating uniform vs. varying value handling.