Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

9 hours ago

Copy Link

Jet-Nemotron is a new family of hybrid-architecture language models that improves generation throughput while maintaining or exceeding the accuracy of full-attention models.
Post Neural Architecture Search (PostNAS) is introduced as a novel pipeline for efficient model design, starting with a pre-trained full-attention model and freezing its MLP weights.
The PostNAS pipeline includes four key components: optimal full-attention layer placement and elimination, linear attention block selection, designing new attention blocks, and hardware-aware hyperparameter search.
Jet-Nemotron-2B model achieves comparable or superior accuracy to models like Qwen3, Qwen2.5, Gemma3, and Llama3.2 across benchmarks, with significant speedups in generation and prefilling.
The model outperforms larger MoE full-attention models like DeepSeek-V3-Small and Moonlight on MMLU and MMLU-Pro benchmarks despite having fewer parameters.

Hasty Briefsbeta