Hasty Briefsbeta

Bilingual

Running Qwen 3.6 Locally on a Mac Mini M4 with 16GB RAM

a day ago
  • #Qwen 3.6
  • #Mac Mini M4
  • #local AI inference
  • Qwen 3.6-35B-A3B is a 35-billion parameter Mixture of Experts model that only activates 3 billion parameters per token, making it runnable on a Mac Mini M4 with 16GB RAM using memory mapping (mmap) in llama.cpp.
  • On a Mac Mini M4 16GB, the model achieves around 17 tokens/second decoding speed with zero swap usage and about 81% memory free, suitable for interactive tasks like chat and code generation.
  • Multiple tools can run the model locally: llama.cpp with mmap is most reliable, Ollama offers easy setup, LM Studio provides a GUI and MLX optimization on 16GB, and raw MLX gives fastest inference but lacks tool calling.
  • The MLX backend in Ollama 0.19 requires 32GB+ memory for higher speeds (~112 tok/s), while on 16GB it defaults to llama.cpp backend; LM Studio's MLX can run on 16GB with lower memory usage and faster speeds.
  • The author's daily setup uses Ollama for background API, LM Studio for faster interactive chat, and llama.cpp for scripting control, with links provided for resources like GGUF quantizations and benchmarks.