LFM2.5-8B-A1B: An Even Better On-Device Mixture-of-Experts

a day ago

LFM2.5-8B-A1B is an edge model released for fast, reliable tool calling on consumer hardware, building on LFM2-8B-A1B with a 128K context window, 38T token pretraining, and reinforcement learning.
Key features include on-device personal assistant capabilities, compressed performance competitive with larger models, and unmatched throughput, with day-one support for llama.cpp, MLX, vLLM, and SGLang.
Improvements over the predecessor include expanded vocabulary to 128K for better non-Latin language tokenization, reasoning-only design with explicit chain of thought, and reduced hallucinations via targeted RL stages.
Training highlights involve tokenizer expansion through BPE merge training, context extension to 128K via RoPE adjustments, and mitigation of doom loops and hallucinations with preference optimization and avg@k-based rewards.
The model benchmarks competitively in knowledge, instruction following, math, and agentic workflows, with low hallucination rates and high efficiency on both CPU and GPU inference across various platforms.
Supported inference ecosystems include LEAP, llama.cpp, MLX, vLLM, SGLang, and ONNX, enabling fast deployment on devices from laptops to phones, with examples like LocalCowork demo showcasing interactive tool-dispatch loops.
LFM2.5-8B-A1B is open-weight, fast from day one, and part of a complete model family, aiming to power on-device, private AI agents without data leaving the device.

Hasty Briefsbeta