Running Qwen 3.6 Locally on a Mac Mini M4 with 16GB RAM

22 days ago

Qwen 3.6-35B-A3B is a 35-billion parameter Mixture of Experts model that only activates 3 billion parameters per token, making it runnable on a Mac Mini M4 with 16GB RAM using memory mapping (mmap) in llama.cpp.
On a Mac Mini M4 16GB, the model achieves around 17 tokens/second decoding speed with zero swap usage and about 81% memory free, suitable for interactive tasks like chat and code generation.
Multiple tools can run the model locally: llama.cpp with mmap is most reliable, Ollama offers easy setup, LM Studio provides a GUI and MLX optimization on 16GB, and raw MLX gives fastest inference but lacks tool calling.
The MLX backend in Ollama 0.19 requires 32GB+ memory for higher speeds (~112 tok/s), while on 16GB it defaults to llama.cpp backend; LM Studio's MLX can run on 16GB with lower memory usage and faster speeds.
The author's daily setup uses Ollama for background API, LM Studio for faster interactive chat, and llama.cpp for scripting control, with links provided for resources like GGUF quantizations and benchmarks.

Hasty Briefsbeta