Re-quantizing a local LLM 14x faster by skipping the tensors that didn't change

10 hours ago

Introduces a method to significantly speed up rebuilding quantized local models from 80 minutes to 5 minutes by reusing unchanged tensors.
Highlights the importance of efficiently allocating bits in quantization, especially for Mixture-of-Experts models like DeepSeek, to avoid wasting resources.
Describes a tool called forgequant that manages quantization recipes and identifies which experts to prioritize based on an importance matrix derived from actual usage.
Explains the use of an importance matrix to track expert activity, allowing for targeted precision upgrades in layers that are critical to specific tasks.
Mentions the potential for models to self-tune based on live usage data, optimizing both quality and speed for individual workflows and hardware.
Emphasizes that the primary claim is the speed improvement and byte-for-byte accuracy of the rebuild process, not necessarily enhanced model performance.
Discusses future possibilities, such as sharing domain-specific recipes and automating bit allocation searches, leveraging the faster iteration capability.

Hasty Briefsbeta