I restarted a 10 year old Xeon 174 times to delete 12 flags and gain 4 TPS

2 days ago

Speculative decoding (drafter) performance varies by workload: beneficial for code generation, neutral for chat, and detrimental for summarization due to low acceptance rates.
Fixed draft lengths outperform autotuning for short generations; e.g., draft length 2 improves chat speed by 31% compared to autotune.
Flash attention is the most impactful flag, doubling speed across workloads; disabling it reduces performance by ~46-52%.
Optimal thread count matches physical cores (8 here); hyperthreading reduces speed by 12%, while fewer cores (4) cuts performance by ~35-45%.
Run-time repacking provides meaningful prefill speedups (~19%) but can affect drafter acceptance rates unpredictably in long documents.
Several flags require specific hardware or setup (e.g., mlock needs persistent memory limits, -sm graph needs multiple devices) and may be inert or harmful if conditions aren't met.
Flags like --mla-use are silently ignored on Gemma 4, offering no benefit and can be safely removed.
Potential pitfalls include mlock failing silently after reboot, drafter-slot deadlocks in multi-slot setups, and default parameter bugs (e.g., explicit -c flag needed even at default context).
Ablation methodology—turning off one flag at a time and checking engine logs—is crucial to validate flag engagement and measure marginal contributions.
Memory bandwidth imposes a hard speed limit (~10-11 tokens/sec at Q8_0); beyond this, only quantizing to fewer bytes per weight can increase speed.

Hasty Briefsbeta