Hasty Briefsbeta

Jet-Nemotron: Efficient Language Model with Post Neural Architecture Search

7 hours ago
  • #neural architecture search
  • #language models
  • #machine learning
  • Jet-Nemotron is a new family of hybrid-architecture language models that improves generation throughput while maintaining or exceeding the accuracy of full-attention models.
  • Post Neural Architecture Search (PostNAS) is introduced as a novel pipeline for efficient model design, starting with a pre-trained full-attention model and freezing its MLP weights.
  • The PostNAS pipeline includes four key components: optimal full-attention layer placement and elimination, linear attention block selection, designing new attention blocks, and hardware-aware hyperparameter search.
  • Jet-Nemotron-2B model achieves comparable or superior accuracy to models like Qwen3, Qwen2.5, Gemma3, and Llama3.2 across benchmarks, with significant speedups in generation and prefilling.
  • The model outperforms larger MoE full-attention models like DeepSeek-V3-Small and Moonlight on MMLU and MMLU-Pro benchmarks despite having fewer parameters.