Hasty Briefsbeta

Bilingual

Meta's Omnilingual MT for 1,600 Languages

3 days ago
  • #multilingual-models
  • #machine-translation
  • #language-technology
  • No Language Left Behind (NLLB) demonstrated high-quality machine translation (MT) scaling to 200 languages.
  • Large Language Models (LLMs) improved MT quality but did not extend language coverage significantly.
  • Current MT systems struggle with limited coverage and generation bottlenecks, leaving many languages unsupported.
  • Omnilingual Machine Translation (OMT) supports over 1,600 languages, the first of its kind.
  • OMT uses a comprehensive data strategy, including manually curated datasets, synthetic backtranslation, and mining.
  • Evaluation includes BLASER 3, OmniTOX, BOUQuET, and Met-BOUQuET for reliable and expansive assessment.
  • Two specialized LLM approaches for MT: OMT-LLaMA (decoder-only) and OMT-NLLB (encoder-decoder).
  • OMT models match or exceed the performance of a 70B LLM baseline, showing specialization advantages.
  • OMT-LLaMA models enable coherent generation for many undersupported languages.
  • OMT models improve cross-lingual transfer, nearing the 'understanding' part of MT for 1,600 languages.
  • Finetuning and retrieval-augmented generation further enhance quality for specific languages.
  • Leaderboard and evaluation datasets (BOUQuET, Met-BOUQuET) are freely available and evolving.