Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search
9 hours ago
- #neural architecture search
- #language models
- #machine learning
- Jet-Nemotron is a new family of hybrid-architecture language models that improves generation throughput while maintaining or exceeding the accuracy of full-attention models.
- Post Neural Architecture Search (PostNAS) is introduced as a novel pipeline for efficient model design, starting with a pre-trained full-attention model and freezing its MLP weights.
- The PostNAS pipeline includes four key components: optimal full-attention layer placement and elimination, linear attention block selection, designing new attention blocks, and hardware-aware hyperparameter search.
- Jet-Nemotron-2B model achieves comparable or superior accuracy to models like Qwen3, Qwen2.5, Gemma3, and Llama3.2 across benchmarks, with significant speedups in generation and prefilling.
- The model outperforms larger MoE full-attention models like DeepSeek-V3-Small and Moonlight on MMLU and MMLU-Pro benchmarks despite having fewer parameters.