Real-time LLM Inference on Standard GPUs: 3k tokens/s per request
8 hours ago
- #AI Inference
- #Memory Bandwidth
- #GPU Optimization
- Kog AI launches tech preview of Kog Inference Engine achieving 3,000 tokens/s on 8× AMD MI300X GPUs and 2,100 tokens/s on 8× NVIDIA H200.
- Optimizing single-request decoding speed is crucial for AI agents, as memory bandwidth is the bottleneck, not compute power.
- Standard GPU hardware has untapped decoding potential due to software inefficiencies; Kog co-designs model architecture, runtime, and kernels.
- Kog's monokernel runtime eliminates kernel launch overheads, uses custom KCCL for low-latency GPU communication, and employs Laneformer architecture with Delayed Tensor Parallelism.
- Memory bandwidth utilization (MBU) is key; current GPUs offer high bandwidth, but software bottlenecks waste microseconds per token.
- Kog's approach removes microsecond losses via persistent GPU programs, topology-aware memory placement, and fused operations.
- Tech preview runs a 2B Laneformer coding model at batch size 1, with future plans to support large third-party MoE models at similar speeds.
- Scaling projections show large MoE models could reach 1,000–5,000 tokens/s on standard GPUs with FP8/FP4 quantization and Kog's stack.
- Dedicated inference hardware excels in single-request speed, but GPUs can compete with optimized software like Kog's engine.