LFM2.5-8B-A1B: An Even Better On-Device Mixture-of-Experts
a day ago
- #Edge Computing
- #Tool Calling
- #AI Model Release
- LFM2.5-8B-A1B is an edge model released for fast, reliable tool calling on consumer hardware, building on LFM2-8B-A1B with a 128K context window, 38T token pretraining, and reinforcement learning.
- Key features include on-device personal assistant capabilities, compressed performance competitive with larger models, and unmatched throughput, with day-one support for llama.cpp, MLX, vLLM, and SGLang.
- Improvements over the predecessor include expanded vocabulary to 128K for better non-Latin language tokenization, reasoning-only design with explicit chain of thought, and reduced hallucinations via targeted RL stages.
- Training highlights involve tokenizer expansion through BPE merge training, context extension to 128K via RoPE adjustments, and mitigation of doom loops and hallucinations with preference optimization and avg@k-based rewards.
- The model benchmarks competitively in knowledge, instruction following, math, and agentic workflows, with low hallucination rates and high efficiency on both CPU and GPU inference across various platforms.
- Supported inference ecosystems include LEAP, llama.cpp, MLX, vLLM, SGLang, and ONNX, enabling fast deployment on devices from laptops to phones, with examples like LocalCowork demo showcasing interactive tool-dispatch loops.
- LFM2.5-8B-A1B is open-weight, fast from day one, and part of a complete model family, aiming to power on-device, private AI agents without data leaving the device.