Hasty Briefsbeta

Bilingual

Show HN: Hacker News archive (47M+ items, 11.6GB) as Parquet, updated every 5m

a day ago
  • #dataset
  • #technology
  • #hacker-news
  • The dataset contains the complete Hacker News archive from 2006 to the present, updated every 5 minutes.
  • It includes every story, comment, Ask HN, Show HN, job posting, and poll ever submitted to the site.
  • The data is organized as one Parquet file per calendar month, with live updates stored in 5-minute blocks.
  • Hacker News is a long-running and influential technology community operated by Y Combinator since 2007.
  • The dataset is useful for research, analysis, and training, including language model pretraining and trend analysis.
  • Common metrics include story scores, most-shared domains, and most active submitters.
  • The dataset is sourced from the ClickHouse Playground, which mirrors the official HN Firebase API.
  • Data fields include item ID, type, author, timestamp, text, URL, score, and more.
  • The dataset is released under the Open Data Commons Attribution License (ODC-By) v1.0.