Zero-Copy GPU Inference from WebAssembly on Apple Silicon

5 hours ago

Apple Silicon's Unified Memory Architecture enables zero-copy data sharing between WebAssembly (Wasm) linear memory and the GPU, eliminating serialization and buffer overhead.
A three-link chain demonstrates this: using mmap for page-aligned memory, Metal's MTLDevice.makeBuffer for zero-copy GPU access, and Wasmtime's MemoryCreator trait to integrate custom memory allocation.
Testing with a 128x128 matrix multiply shows correct results, no memory overhead (RSS delta ~0.03 MB vs. 16.78 MB for copy path), and identical compute latency, confirming efficient data sharing.
Applied to AI inference, this allows running models like Llama 3.2 1B from a Wasm actor with negligible host function boundary costs, enabling fast prefill and per-token generation.
The portable KV cache can be serialized and restored, offering significant speedups (e.g., 5.45x faster at 24 tokens) and enabling stateful actor mobility across machines or model swaps.
Driftwood is being built as a runtime for stateful Wasm actors with GPU inference, focusing on actor snapshots, checkpoint portability, and multi-model support, though it's still in early development.

Hasty Briefsbeta