Hasty Briefsbeta

Bilingual

Teuken-7B-Base and Teuken-7B-Instruct: Towards European LLMs

a year ago
  • #LLM
  • #multilingual
  • #European Union
  • Introduction of Teuken-7B-Base and Teuken-7B-Instruct, multilingual LLMs supporting all 24 official EU languages.
  • Models trained on ~60% non-English data with a custom multilingual tokenizer to address English-centric LLM limitations.
  • Detailed development principles include data composition, tokenizer optimization, and training methodologies.
  • Competitive performance demonstrated on European versions of ARC, HellaSwag, MMLU, and TruthfulQA benchmarks.