Hasty Briefsbeta

Bilingual

I restarted a 10 year old Xeon 174 times to delete 12 flags and gain 4 TPS

2 days ago
  • #benchmarking
  • #machine-learning
  • #optimization
  • Speculative decoding (drafter) performance varies by workload: beneficial for code generation, neutral for chat, and detrimental for summarization due to low acceptance rates.
  • Fixed draft lengths outperform autotuning for short generations; e.g., draft length 2 improves chat speed by 31% compared to autotune.
  • Flash attention is the most impactful flag, doubling speed across workloads; disabling it reduces performance by ~46-52%.
  • Optimal thread count matches physical cores (8 here); hyperthreading reduces speed by 12%, while fewer cores (4) cuts performance by ~35-45%.
  • Run-time repacking provides meaningful prefill speedups (~19%) but can affect drafter acceptance rates unpredictably in long documents.
  • Several flags require specific hardware or setup (e.g., mlock needs persistent memory limits, -sm graph needs multiple devices) and may be inert or harmful if conditions aren't met.
  • Flags like --mla-use are silently ignored on Gemma 4, offering no benefit and can be safely removed.
  • Potential pitfalls include mlock failing silently after reboot, drafter-slot deadlocks in multi-slot setups, and default parameter bugs (e.g., explicit -c flag needed even at default context).
  • Ablation methodology—turning off one flag at a time and checking engine logs—is crucial to validate flag engagement and measure marginal contributions.
  • Memory bandwidth imposes a hard speed limit (~10-11 tokens/sec at Q8_0); beyond this, only quantizing to fewer bytes per weight can increase speed.