The Common Pile v0.1: An 8TB dataset of public domain and openly licensed text
a year ago
- #open-source
- #LLM
- #dataset
- The Common Pile v0.1 is an 8TB dataset of openly licensed text for LLM pretraining.
- It addresses ethical concerns by avoiding unlicensed text, unlike many existing LLM training datasets.
- The dataset includes diverse content from 30 sources like research papers, books, and educational materials.
- Two 7B parameter LLMs (Comma v0.1-1T and Comma v0.1-2T) were trained on the dataset, showing competitive performance.
- The release includes the dataset, creation code, training mixtures, and model checkpoints.