Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs

17 days ago

Copy Link

Day zero model performance optimization involves experimentation, bug fixing, and benchmarking.
Achieved SOTA latency and throughput for GPT OSS 120B on NVIDIA GPUs using the Baseten Inference Stack.
Optimized performance by testing across frameworks (TensorRT-LLM, vLLM, SGLang) and ensuring GPU compatibility.
Key optimizations included KV cache-aware routing and speculative decoding with Eagle.
Steps included running baseline inference, fixing compatibility bugs, and optimizing model configuration.
Selected Tensor Parallelism for better latency over Expert Parallelism for throughput.
Future improvements include adding speculative decoding with Eagle 3.
Baseten is hiring model performance engineers and offers support for optimizing AI models.

Hasty Briefsbeta