Hasty Briefsbeta

Running GPT-OSS-120B at 500 tokens per second on Nvidia GPUs

17 days ago
  • #Model Performance
  • #AI Optimization
  • #NVIDIA GPUs
  • Day zero model performance optimization involves experimentation, bug fixing, and benchmarking.
  • Achieved SOTA latency and throughput for GPT OSS 120B on NVIDIA GPUs using the Baseten Inference Stack.
  • Optimized performance by testing across frameworks (TensorRT-LLM, vLLM, SGLang) and ensuring GPU compatibility.
  • Key optimizations included KV cache-aware routing and speculative decoding with Eagle.
  • Steps included running baseline inference, fixing compatibility bugs, and optimizing model configuration.
  • Selected Tensor Parallelism for better latency over Expert Parallelism for throughput.
  • Future improvements include adding speculative decoding with Eagle 3.
  • Baseten is hiring model performance engineers and offers support for optimizing AI models.