WebGPU feature detection was not enough to run small LLMs on phones

15 hours ago

WebGPU feature detection alone is insufficient for running small LLMs on phones, as adapter limits do not guarantee successful inference completion.
Testing across four environments revealed failures: Safari on iPhone reloaded pages during generation, and LINE's in-app browser stalled without completing runs.
Performance varied significantly: on a Windows desktop, WebLLM decoded tokens twice as fast as wllama despite identical WebGPU support.
On a Pixel 8a in Chrome, a long prompt (1213 tokens) took 76+ seconds for first token, versus ~4 seconds for a short prompt, highlighting context-length challenges.

Hasty Briefsbeta