LFM2-24B-A2B: Scaling Up the LFM2 Architecture

2 days ago

Release of LFM2-24B-A2B, a 24B total parameter sparse Mixture of Experts (MoE) model with 2B active parameters per token, marking the largest in the LFM2 family.
The LFM2 architecture scales from 350M to 24B, showing consistent quality gains on benchmarks and is designed to run on 32GB RAM for cloud and edge deployment.
Open-weight availability on Hugging Face with support for local execution, fine-tuning, and a playground for testing.
Scaling strategy involves deeper layers (40 vs. 24), more experts (64 vs. 32), and a lean active path, keeping per-token compute low while expanding total parameters.
Benchmarks (e.g., GPQA Diamond, MMLU-Pro) show log-linear quality improvements across the LFM2 family, confirming predictable scaling without size limitations.
Inference support via llama.cpp, vLLM, and SGLang with multiple quantization options, outperforming similar MoE models in throughput tests on CPUs, GPUs, and NPUs.
Ongoing pre-training beyond 17T tokens, with plans for an enhanced LFM2.5-24B-A2B version post-training and reinforcement learning.

Hasty Briefsbeta