Hasty Briefsbeta

Bilingual

The Common Pile v0.1: An 8TB dataset of public domain and openly licensed text

a year ago
  • #open-source
  • #LLM
  • #dataset
  • The Common Pile v0.1 is an 8TB dataset of openly licensed text for LLM pretraining.
  • It addresses ethical concerns by avoiding unlicensed text, unlike many existing LLM training datasets.
  • The dataset includes diverse content from 30 sources like research papers, books, and educational materials.
  • Two 7B parameter LLMs (Comma v0.1-1T and Comma v0.1-2T) were trained on the dataset, showing competitive performance.
  • The release includes the dataset, creation code, training mixtures, and model checkpoints.