Running Google Gemma 4 Locally with LM Studio's New Headless CLI and Claude Code

a month ago

LM Studio 0.4.0 introduced llmster and a CLI for headless local model inference.
Google Gemma 4 26B-A4B is a mixture-of-experts model offering high performance with lower resource use.
On a MacBook Pro M4 Pro (48 GB), Gemma 4 runs at 51 tokens/second with 256K context support.
The article details installing LM Studio CLI, downloading models, and configuring settings.
It explains how to estimate memory usage, manage context length, and use parallel requests.
Gemma 4 can be integrated with Claude Code via an Anthropic-compatible endpoint for local coding assistance.
Key benefits include privacy, cost savings, and offline use, but slower speed and memory constraints are noted.
The setup involves environment variables to route Claude Code to the local LM Studio server.
Future plans include testing other local models like Qwen 3.5 and GLM 4.7 Flash.

Hasty Briefsbeta