Real-time LLM Inference on Standard GPUs: 3k tokens/s per request

8 hours ago

Kog AI launches tech preview of Kog Inference Engine achieving 3,000 tokens/s on 8× AMD MI300X GPUs and 2,100 tokens/s on 8× NVIDIA H200.
Optimizing single-request decoding speed is crucial for AI agents, as memory bandwidth is the bottleneck, not compute power.
Standard GPU hardware has untapped decoding potential due to software inefficiencies; Kog co-designs model architecture, runtime, and kernels.
Kog's monokernel runtime eliminates kernel launch overheads, uses custom KCCL for low-latency GPU communication, and employs Laneformer architecture with Delayed Tensor Parallelism.
Memory bandwidth utilization (MBU) is key; current GPUs offer high bandwidth, but software bottlenecks waste microseconds per token.
Kog's approach removes microsecond losses via persistent GPU programs, topology-aware memory placement, and fused operations.
Tech preview runs a 2B Laneformer coding model at batch size 1, with future plans to support large third-party MoE models at similar speeds.
Scaling projections show large MoE models could reach 1,000–5,000 tokens/s on standard GPUs with FP8/FP4 quantization and Kog's stack.
Dedicated inference hardware excels in single-request speed, but GPUs can compete with optimized software like Kog's engine.

Hasty Briefsbeta