RepoRoulette: Randomly sample repositories from GitHub
a year ago
- #sampling
- #GitHub
- #open-source
- RepoRoulette provides three methods for random GitHub repository sampling: IDSampler, TemporalSampler, and BigQuerySampler.
- IDSampler uses GitHub's sequential repository ID system for random sampling but has a low hit rate.
- TemporalSampler selects repositories updated during random time periods within a specified date range.
- BigQuerySampler leverages Google BigQuery's public GitHub dataset for advanced filtering but requires a GCP account.
- GHArchiveSampler samples repositories from GitHub Archive, which records public GitHub events.
- Applications include academic research, learning resources, data science, trend analysis, and security research.
- The project is open for contributions and licensed under MIT.