Running Qwen 3.6 Locally on a Mac Mini M4 with 16GB RAM
a day ago
- #Qwen 3.6
- #Mac Mini M4
- #local AI inference
- Qwen 3.6-35B-A3B is a 35-billion parameter Mixture of Experts model that only activates 3 billion parameters per token, making it runnable on a Mac Mini M4 with 16GB RAM using memory mapping (mmap) in llama.cpp.
- On a Mac Mini M4 16GB, the model achieves around 17 tokens/second decoding speed with zero swap usage and about 81% memory free, suitable for interactive tasks like chat and code generation.
- Multiple tools can run the model locally: llama.cpp with mmap is most reliable, Ollama offers easy setup, LM Studio provides a GUI and MLX optimization on 16GB, and raw MLX gives fastest inference but lacks tool calling.
- The MLX backend in Ollama 0.19 requires 32GB+ memory for higher speeds (~112 tok/s), while on 16GB it defaults to llama.cpp backend; LM Studio's MLX can run on 16GB with lower memory usage and faster speeds.
- The author's daily setup uses Ollama for background API, LM Studio for faster interactive chat, and llama.cpp for scripting control, with links provided for resources like GGUF quantizations and benchmarks.