GLM-4.5V: An open-source multimodal large language model from Zhipu AI
9 months ago
- #Open Source
- #Multimodal AI
- #Vision-Language Models
- GLM-4.5V and GLM-4.1V series models are open-sourced, enhancing vision-language model (VLM) reasoning capabilities.
- GLM-4.5V offers significant improvements across multiple benchmarks and includes a desktop assistant app for debugging.
- GLM-4.1V-9B-Thinking introduces a reasoning paradigm and RLCS for enhanced capabilities, outperforming larger models in 18 tasks.
- Both models support multimodal preprocessing but use different conversation templates.
- Installation and inference steps are provided for NVIDIA GPUs, with options for SGLang and vLLM.
- Fine-tuning support is available via LLaMA-Factory, with dataset construction examples provided.
- GLM-4.5V focuses on real-world usability, handling diverse visual content types and introducing a Thinking Mode switch.
- Known issues include frontend code reproduction errors, overthinking, and occasional answer restatement.
- Citations and technical details are provided for academic use.