Teuken-7B-Base and Teuken-7B-Instruct: Towards European LLMs
a year ago
- #LLM
- #multilingual
- #European Union
- Introduction of Teuken-7B-Base and Teuken-7B-Instruct, multilingual LLMs supporting all 24 official EU languages.
- Models trained on ~60% non-English data with a custom multilingual tokenizer to address English-centric LLM limitations.
- Detailed development principles include data composition, tokenizer optimization, and training methodologies.
- Competitive performance demonstrated on European versions of ARC, HellaSwag, MMLU, and TruthfulQA benchmarks.