A 10 year old Xeon is all you need (for 26B-A4B MTP Drafters without GPU)

24 days ago

Explains running Gemma 4 26B-A4B on an old Xeon server without GPU using ik_llama.cpp with custom optimizations.
Highlights memory bandwidth as the main bottleneck for LLM inference, requiring optimizations like speculative decoding and memory repacking.
Uses speculative decoding with MTP drafter to reduce memory-bound decoder passes, improving token generation speed.
Optimizes MoE routing for CPU with flags like --cpu-moe and --merge-up-gate-experts to prevent cache thrashing.
Applies memory management techniques: --mlock prevents swapping, --run-time-repack aligns weights for CPU cache.
Discusses graph split modes and attention optimizations like Flash Attention and MLA to handle large context efficiently.
Demonstrates running a 25B-parameter model on DDR3 RAM, achieving reading speed through deep engine tuning.
Concludes that understanding inference engines and hardware mapping enables state-of-the-art AI on old hardware.

Hasty Briefsbeta