GPT-OSS vs. Qwen3 and a detailed look how things evolved since GPT-2
14 days ago
- #LLM
- #architecture
- #OpenAI
- OpenAI released new open-weight LLMs: gpt-oss-120b and gpt-oss-20b, their first since GPT-2 in 2019.
- The architecture includes optimizations like MXFP4 quantization, allowing models to run locally on single GPUs.
- Key architectural changes from GPT-2 include removing dropout, using RoPE for positional embeddings, and replacing GELU with Swish/SwiGLU.
- Mixture-of-Experts (MoE) replaces single feed-forward modules, increasing model capacity while keeping inference efficient.
- Grouped Query Attention (GQA) and sliding window attention improve computational efficiency.
- RMSNorm replaces LayerNorm for better training efficiency.
- Comparison with Qwen3 shows differences in width vs. depth and expert configurations.
- gpt-oss models support adjustable reasoning effort levels (low/medium/high) via system prompts.
- Benchmarks show gpt-oss is competitive with proprietary models and Qwen3, despite being smaller.
- GPT-5 was released shortly after gpt-oss, with gpt-oss performing surprisingly well in comparison.