Unsloth GLM-5.2 – How to Run Locally

5 hours ago

GLM-5.2 is a new open model from Z.ai with 744B parameters, 40B active parameters, and a 1M context window.
It can be run locally using Unsloth Dynamic GGUFs, which reduce the model size from 1.51TB to 239GB (2-bit) or 217GB (1-bit).
Hardware requirements vary by quantization level, with 1-bit needing 223GB total memory and 8-bit needing 810GB.
The model offers three thinking modes: Non-thinking, High, and Max, with Max recommended for complex tasks.
Default settings for most tasks include a temperature of 1.0 and top_p of 0.95.
Quantization analysis shows dynamic 1-bit achieving 76.2% accuracy and dynamic 2-bit around 82%, while dynamic 4-bit and 5-bit are near-lossless.
GLM-5.2 can be run in Unsloth Studio for a web UI experience or via llama.cpp for command-line inference.
Long context support is enhanced through KV cache quantization, allowing for extended context lengths.
Benchmarks indicate GLM-5.2 performs on par with top models like Claude 4.8 Opus and GPT-5.5.

Hasty Briefsbeta