The Common Pile v0.1: An 8TB dataset of public domain and openly licensed text

a year ago

The Common Pile v0.1 is an 8TB dataset of openly licensed text for LLM pretraining.
It addresses ethical concerns by avoiding unlicensed text, unlike many existing LLM training datasets.
The dataset includes diverse content from 30 sources like research papers, books, and educational materials.
Two 7B parameter LLMs (Comma v0.1-1T and Comma v0.1-2T) were trained on the dataset, showing competitive performance.
The release includes the dataset, creation code, training mixtures, and model checkpoints.

Hasty Briefsbeta