Meta's Omnilingual MT for 1,600 Languages
3 days ago
- #multilingual-models
- #machine-translation
- #language-technology
- No Language Left Behind (NLLB) demonstrated high-quality machine translation (MT) scaling to 200 languages.
- Large Language Models (LLMs) improved MT quality but did not extend language coverage significantly.
- Current MT systems struggle with limited coverage and generation bottlenecks, leaving many languages unsupported.
- Omnilingual Machine Translation (OMT) supports over 1,600 languages, the first of its kind.
- OMT uses a comprehensive data strategy, including manually curated datasets, synthetic backtranslation, and mining.
- Evaluation includes BLASER 3, OmniTOX, BOUQuET, and Met-BOUQuET for reliable and expansive assessment.
- Two specialized LLM approaches for MT: OMT-LLaMA (decoder-only) and OMT-NLLB (encoder-decoder).
- OMT models match or exceed the performance of a 70B LLM baseline, showing specialization advantages.
- OMT-LLaMA models enable coherent generation for many undersupported languages.
- OMT models improve cross-lingual transfer, nearing the 'understanding' part of MT for 1,600 languages.
- Finetuning and retrieval-augmented generation further enhance quality for specific languages.
- Leaderboard and evaluation datasets (BOUQuET, Met-BOUQuET) are freely available and evolving.