Choosing a GGUF Model: K-Quants, IQ Variants, and Legacy Formats

11 hours ago

GGUF is a common format for local LLM inference, introduced by llama.cpp and used by Ollama and others. It stores weights in blockwise quantized formats, balancing accuracy, speed, and memory.
Legacy formats (e.g., Q4_0, Q5_0) use simple linear per-block quantization. They are fast but less accurate at lower bits, with Q8_0 being near-lossless and suitable as an INT8 baseline.
K-quants (e.g., Q2_K, Q4_K) are modern defaults for 3–6 bits, using two-level schemes for better accuracy per bit. Q4_K_M is popular for its balance of low memory and high accuracy.
I-quants (e.g., IQ2_XS, IQ4_XS) focus on quality at low precision via importance-matrix reconstruction. They offer compression but can be sensitive to quantization quality and hardware.
Formats like TQ1_0 use ternary weights for extreme compression (~1.6 bits/weight), useful for large models. GGUF also supports unquantized tensors and hybrid models for mixed precision.

Hasty Briefsbeta