Hasty Briefsbeta

Bilingual

A 10 year old Xeon is all you need (for 26B-A4B MTP Drafters without GPU)

13 hours ago
  • #Speculative Decoding
  • #CPU AI Deployment
  • #LLM Inference Optimization
  • Explains running Gemma 4 26B-A4B on an old Xeon server without GPU using ik_llama.cpp with custom optimizations.
  • Highlights memory bandwidth as the main bottleneck for LLM inference, requiring optimizations like speculative decoding and memory repacking.
  • Uses speculative decoding with MTP drafter to reduce memory-bound decoder passes, improving token generation speed.
  • Optimizes MoE routing for CPU with flags like --cpu-moe and --merge-up-gate-experts to prevent cache thrashing.
  • Applies memory management techniques: --mlock prevents swapping, --run-time-repack aligns weights for CPU cache.
  • Discusses graph split modes and attention optimizations like Flash Attention and MLA to handle large context efficiently.
  • Demonstrates running a 25B-parameter model on DDR3 RAM, achieving reading speed through deep engine tuning.
  • Concludes that understanding inference engines and hardware mapping enables state-of-the-art AI on old hardware.