Meta's Omnilingual MT for 1,600 Languages

3 days ago

No Language Left Behind (NLLB) demonstrated high-quality machine translation (MT) scaling to 200 languages.
Large Language Models (LLMs) improved MT quality but did not extend language coverage significantly.
Current MT systems struggle with limited coverage and generation bottlenecks, leaving many languages unsupported.
Omnilingual Machine Translation (OMT) supports over 1,600 languages, the first of its kind.
OMT uses a comprehensive data strategy, including manually curated datasets, synthetic backtranslation, and mining.
Evaluation includes BLASER 3, OmniTOX, BOUQuET, and Met-BOUQuET for reliable and expansive assessment.
Two specialized LLM approaches for MT: OMT-LLaMA (decoder-only) and OMT-NLLB (encoder-decoder).
OMT models match or exceed the performance of a 70B LLM baseline, showing specialization advantages.
OMT-LLaMA models enable coherent generation for many undersupported languages.
OMT models improve cross-lingual transfer, nearing the 'understanding' part of MT for 1,600 languages.
Finetuning and retrieval-augmented generation further enhance quality for specific languages.
Leaderboard and evaluation datasets (BOUQuET, Met-BOUQuET) are freely available and evolving.

Hasty Briefsbeta