Unsloth GLM-5.2 – How to Run Locally
5 hours ago
- #Local Inference
- #AI Model
- #Quantization
- GLM-5.2 is a new open model from Z.ai with 744B parameters, 40B active parameters, and a 1M context window.
- It can be run locally using Unsloth Dynamic GGUFs, which reduce the model size from 1.51TB to 239GB (2-bit) or 217GB (1-bit).
- Hardware requirements vary by quantization level, with 1-bit needing 223GB total memory and 8-bit needing 810GB.
- The model offers three thinking modes: Non-thinking, High, and Max, with Max recommended for complex tasks.
- Default settings for most tasks include a temperature of 1.0 and top_p of 0.95.
- Quantization analysis shows dynamic 1-bit achieving 76.2% accuracy and dynamic 2-bit around 82%, while dynamic 4-bit and 5-bit are near-lossless.
- GLM-5.2 can be run in Unsloth Studio for a web UI experience or via llama.cpp for command-line inference.
- Long context support is enhanced through KV cache quantization, allowing for extended context lengths.
- Benchmarks indicate GLM-5.2 performs on par with top models like Claude 4.8 Opus and GPT-5.5.