Choosing a GGUF Model: K-Quants, IQ Variants, and Legacy Formats
11 hours ago
- #GGUF format
- #Model quantization
- #LLM inference
- GGUF is a common format for local LLM inference, introduced by llama.cpp and used by Ollama and others. It stores weights in blockwise quantized formats, balancing accuracy, speed, and memory.
- Legacy formats (e.g., Q4_0, Q5_0) use simple linear per-block quantization. They are fast but less accurate at lower bits, with Q8_0 being near-lossless and suitable as an INT8 baseline.
- K-quants (e.g., Q2_K, Q4_K) are modern defaults for 3–6 bits, using two-level schemes for better accuracy per bit. Q4_K_M is popular for its balance of low memory and high accuracy.
- I-quants (e.g., IQ2_XS, IQ4_XS) focus on quality at low precision via importance-matrix reconstruction. They offer compression but can be sensitive to quantization quality and hardware.
- Formats like TQ1_0 use ternary weights for extreme compression (~1.6 bits/weight), useful for large models. GGUF also supports unquantized tensors and hybrid models for mixed precision.