A Tiny Compiler for Data-Parallel Kernels

2 days ago

A compiler transformation rewrites regular loops to allow multiple iterations to run in parallel using grouped execution.
The compiler classifies values as uniform (same across lanes) or varying (different per lane) to guide lowering decisions.
Uniform values are shared, while varying values require per-lane computation, affecting memory access patterns.
For loops are replaced with vector_for loops, where lanes represent independent positions in grouped execution.
Masks handle cases where data isn't divisible by lane count, skipping out-of-bounds lanes.
Loads with contiguous indices become masked_loads, while non-contiguous varying loads become gathers.
Gathers allow parallel reads from different addresses, though often slower than contiguous accesses.
The lowering step makes parallelism explicit, enabling later code generation to use faster instructions.
The compiler's input is hand-written ASTs, outputting a lowered IR, focusing on dependency analysis.
Code examples include scaling audio and color mapping kernels, demonstrating uniform vs. varying value handling.

Hasty Briefsbeta