Hasty Briefsbeta

Bilingual

Norway's 2 petabytes of Huawei flash storage and LLM training

3 hours ago
  • #Huawei Storage
  • #Norwegian LLM
  • #Sovereign AI
  • Norway's National Library is developing a sovereign LLM for the Norwegian language, addressing cultural preservation.
  • The project uses 2 PB of Huawei OceanStor Dorado flash storage in AI training pipelines.
  • No commercial LLM provider was developing a Norwegian language model, highlighting the need for local solutions.
  • The library has a legal mandate to collect and digitize Norway's cultural heritage, including books, newspapers, and web content.
  • An agreement with newspapers allows LLM training on copyrighted content, which private companies lack.
  • Digitization since 2005 has resulted in 20 PB of unique data stored in a 3-2-1 format.
  • Key bottlenecks include data quality, cleaning, and pipeline throughput, not compute resources.
  • Data processing uses an Nvidia DGX H200 system, CPU clusters, and Huawei flash arrays before training on a national supercomputer.
  • Challenges involve moving PB-scale datasets from archival to high-throughput AI storage systems.
  • Issues like evaluation tools, governance, and orchestration of multiple systems are ongoing learning points.
  • Huawei storage is significant in the European market, and countries developing sovereign LLMs can learn from Norway's experience.