Norway's 2 petabytes of Huawei flash storage and LLM training

3 hours ago

Norway's National Library is developing a sovereign LLM for the Norwegian language, addressing cultural preservation.
The project uses 2 PB of Huawei OceanStor Dorado flash storage in AI training pipelines.
No commercial LLM provider was developing a Norwegian language model, highlighting the need for local solutions.
The library has a legal mandate to collect and digitize Norway's cultural heritage, including books, newspapers, and web content.
An agreement with newspapers allows LLM training on copyrighted content, which private companies lack.
Digitization since 2005 has resulted in 20 PB of unique data stored in a 3-2-1 format.
Key bottlenecks include data quality, cleaning, and pipeline throughput, not compute resources.
Data processing uses an Nvidia DGX H200 system, CPU clusters, and Huawei flash arrays before training on a national supercomputer.
Challenges involve moving PB-scale datasets from archival to high-throughput AI storage systems.
Issues like evaluation tools, governance, and orchestration of multiple systems are ongoing learning points.
Huawei storage is significant in the European market, and countries developing sovereign LLMs can learn from Norway's experience.

Hasty Briefsbeta