GLM-4.5V: An open-source multimodal large language model from Zhipu AI

9 months ago

GLM-4.5V and GLM-4.1V series models are open-sourced, enhancing vision-language model (VLM) reasoning capabilities.
GLM-4.5V offers significant improvements across multiple benchmarks and includes a desktop assistant app for debugging.
GLM-4.1V-9B-Thinking introduces a reasoning paradigm and RLCS for enhanced capabilities, outperforming larger models in 18 tasks.
Both models support multimodal preprocessing but use different conversation templates.
Installation and inference steps are provided for NVIDIA GPUs, with options for SGLang and vLLM.
Fine-tuning support is available via LLaMA-Factory, with dataset construction examples provided.
GLM-4.5V focuses on real-world usability, handling diverse visual content types and introducing a Thinking Mode switch.
Known issues include frontend code reproduction errors, overthinking, and occasional answer restatement.
Citations and technical details are provided for academic use.

Hasty Briefsbeta