Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs
17 days ago
- #Model Performance
- #AI Optimization
- #NVIDIA GPUs
- Day zero model performance optimization involves experimentation, bug fixing, and benchmarking.
- Achieved SOTA latency and throughput for GPT OSS 120B on NVIDIA GPUs using the Baseten Inference Stack.
- Optimized performance by testing across frameworks (TensorRT-LLM, vLLM, SGLang) and ensuring GPU compatibility.
- Key optimizations included KV cache-aware routing and speculative decoding with Eagle.
- Steps included running baseline inference, fixing compatibility bugs, and optimizing model configuration.
- Selected Tensor Parallelism for better latency over Expert Parallelism for throughput.
- Future improvements include adding speculative decoding with Eagle 3.
- Baseten is hiring model performance engineers and offers support for optimizing AI models.