Show HN: Hacker News archive (47M+ items, 11.6GB) as Parquet, updated every 5m
a day ago
- #dataset
- #technology
- #hacker-news
- The dataset contains the complete Hacker News archive from 2006 to the present, updated every 5 minutes.
- It includes every story, comment, Ask HN, Show HN, job posting, and poll ever submitted to the site.
- The data is organized as one Parquet file per calendar month, with live updates stored in 5-minute blocks.
- Hacker News is a long-running and influential technology community operated by Y Combinator since 2007.
- The dataset is useful for research, analysis, and training, including language model pretraining and trend analysis.
- Common metrics include story scores, most-shared domains, and most active submitters.
- The dataset is sourced from the ClickHouse Playground, which mirrors the official HN Firebase API.
- Data fields include item ID, type, author, timestamp, text, URL, score, and more.
- The dataset is released under the Open Data Commons Attribution License (ODC-By) v1.0.