WebGPU feature detection was not enough to run small LLMs on phones
15 hours ago
- #Browser Performance
- #LLM Inference
- #WebGPU
- WebGPU feature detection alone is insufficient for running small LLMs on phones, as adapter limits do not guarantee successful inference completion.
- Testing across four environments revealed failures: Safari on iPhone reloaded pages during generation, and LINE's in-app browser stalled without completing runs.
- Performance varied significantly: on a Windows desktop, WebLLM decoded tokens twice as fast as wllama despite identical WebGPU support.
- On a Pixel 8a in Chrome, a long prompt (1213 tokens) took 76+ seconds for first token, versus ~4 seconds for a short prompt, highlighting context-length challenges.