Zero-Copy GPU Inference from WebAssembly on Apple Silicon
6 hours ago
- #Apple Silicon
- #GPU Acceleration
- #WebAssembly
- Apple Silicon's Unified Memory Architecture enables zero-copy data sharing between WebAssembly (Wasm) linear memory and the GPU, eliminating serialization and buffer overhead.
- A three-link chain demonstrates this: using mmap for page-aligned memory, Metal's MTLDevice.makeBuffer for zero-copy GPU access, and Wasmtime's MemoryCreator trait to integrate custom memory allocation.
- Testing with a 128x128 matrix multiply shows correct results, no memory overhead (RSS delta ~0.03 MB vs. 16.78 MB for copy path), and identical compute latency, confirming efficient data sharing.
- Applied to AI inference, this allows running models like Llama 3.2 1B from a Wasm actor with negligible host function boundary costs, enabling fast prefill and per-token generation.
- The portable KV cache can be serialized and restored, offering significant speedups (e.g., 5.45x faster at 24 tokens) and enabling stateful actor mobility across machines or model swaps.
- Driftwood is being built as a runtime for stateful Wasm actors with GPU inference, focusing on actor snapshots, checkpoint portability, and multi-model support, though it's still in early development.