Hasty Briefsbeta

Bilingual

WebGPU feature detection was not enough to run small LLMs on phones

13 hours ago
  • #Browser Performance
  • #LLM Inference
  • #WebGPU
  • WebGPU feature detection alone is insufficient for running small LLMs on phones, as adapter limits do not guarantee successful inference completion.
  • Testing across four environments revealed failures: Safari on iPhone reloaded pages during generation, and LINE's in-app browser stalled without completing runs.
  • Performance varied significantly: on a Windows desktop, WebLLM decoded tokens twice as fast as wllama despite identical WebGPU support.
  • On a Pixel 8a in Chrome, a long prompt (1213 tokens) took 76+ seconds for first token, versus ~4 seconds for a short prompt, highlighting context-length challenges.