A 10 year old Xeon is all you need (for 26B-A4B MTP Drafters without GPU)
13 hours ago
- #Speculative Decoding
- #CPU AI Deployment
- #LLM Inference Optimization
- Explains running Gemma 4 26B-A4B on an old Xeon server without GPU using ik_llama.cpp with custom optimizations.
- Highlights memory bandwidth as the main bottleneck for LLM inference, requiring optimizations like speculative decoding and memory repacking.
- Uses speculative decoding with MTP drafter to reduce memory-bound decoder passes, improving token generation speed.
- Optimizes MoE routing for CPU with flags like --cpu-moe and --merge-up-gate-experts to prevent cache thrashing.
- Applies memory management techniques: --mlock prevents swapping, --run-time-repack aligns weights for CPU cache.
- Discusses graph split modes and attention optimizations like Flash Attention and MLA to handle large context efficiently.
- Demonstrates running a 25B-parameter model on DDR3 RAM, achieving reading speed through deep engine tuning.
- Concludes that understanding inference engines and hardware mapping enables state-of-the-art AI on old hardware.