Hasty Briefsbeta

Bilingual

Unsloth GLM-5.2 – How to Run Locally

5 hours ago
  • #Local Inference
  • #AI Model
  • #Quantization
  • GLM-5.2 is a new open model from Z.ai with 744B parameters, 40B active parameters, and a 1M context window.
  • It can be run locally using Unsloth Dynamic GGUFs, which reduce the model size from 1.51TB to 239GB (2-bit) or 217GB (1-bit).
  • Hardware requirements vary by quantization level, with 1-bit needing 223GB total memory and 8-bit needing 810GB.
  • The model offers three thinking modes: Non-thinking, High, and Max, with Max recommended for complex tasks.
  • Default settings for most tasks include a temperature of 1.0 and top_p of 0.95.
  • Quantization analysis shows dynamic 1-bit achieving 76.2% accuracy and dynamic 2-bit around 82%, while dynamic 4-bit and 5-bit are near-lossless.
  • GLM-5.2 can be run in Unsloth Studio for a web UI experience or via llama.cpp for command-line inference.
  • Long context support is enhanced through KV cache quantization, allowing for extended context lengths.
  • Benchmarks indicate GLM-5.2 performs on par with top models like Claude 4.8 Opus and GPT-5.5.