Hasty Briefsbeta

Bilingual

Zero-Copy GPU Inference from WebAssembly on Apple Silicon

5 hours ago
  • #Apple Silicon
  • #GPU Acceleration
  • #WebAssembly
  • Apple Silicon's Unified Memory Architecture enables zero-copy data sharing between WebAssembly (Wasm) linear memory and the GPU, eliminating serialization and buffer overhead.
  • A three-link chain demonstrates this: using mmap for page-aligned memory, Metal's MTLDevice.makeBuffer for zero-copy GPU access, and Wasmtime's MemoryCreator trait to integrate custom memory allocation.
  • Testing with a 128x128 matrix multiply shows correct results, no memory overhead (RSS delta ~0.03 MB vs. 16.78 MB for copy path), and identical compute latency, confirming efficient data sharing.
  • Applied to AI inference, this allows running models like Llama 3.2 1B from a Wasm actor with negligible host function boundary costs, enabling fast prefill and per-token generation.
  • The portable KV cache can be serialized and restored, offering significant speedups (e.g., 5.45x faster at 24 tokens) and enabling stateful actor mobility across machines or model swaps.
  • Driftwood is being built as a runtime for stateful Wasm actors with GPU inference, focusing on actor snapshots, checkpoint portability, and multi-model support, though it's still in early development.