Norway's 2 petabytes of Huawei flash storage and LLM training
3 hours ago
- #Huawei Storage
- #Norwegian LLM
- #Sovereign AI
- Norway's National Library is developing a sovereign LLM for the Norwegian language, addressing cultural preservation.
- The project uses 2 PB of Huawei OceanStor Dorado flash storage in AI training pipelines.
- No commercial LLM provider was developing a Norwegian language model, highlighting the need for local solutions.
- The library has a legal mandate to collect and digitize Norway's cultural heritage, including books, newspapers, and web content.
- An agreement with newspapers allows LLM training on copyrighted content, which private companies lack.
- Digitization since 2005 has resulted in 20 PB of unique data stored in a 3-2-1 format.
- Key bottlenecks include data quality, cleaning, and pipeline throughput, not compute resources.
- Data processing uses an Nvidia DGX H200 system, CPU clusters, and Huawei flash arrays before training on a national supercomputer.
- Challenges involve moving PB-scale datasets from archival to high-throughput AI storage systems.
- Issues like evaluation tools, governance, and orchestration of multiple systems are ongoing learning points.
- Huawei storage is significant in the European market, and countries developing sovereign LLMs can learn from Norway's experience.