Classifying aviation-related posts on Hacker News with SLMs
a year ago
- #machine-learning
- #data-analysis
- #aviation
- Hacker News has a surprisingly high volume of aviation-related content.
- The author used Small Language Models (SLMs) to classify 42 million Hacker News posts for aviation relevance.
- Data was gathered via Hacker News API, processed, and stored in Cloudflare R2 Bucket.
- A pipeline was created to preprocess posts, concatenating titles and texts for model input.
- Model selection and prompt prototyping were done on 10,000 posts for efficiency.
- The final analysis classified 0.62% of all posts and 1.13% of top stories as aviation-related.
- Aviation-related posts have increased over time, with spikes during major aviation incidents.
- The top 30 contributors to aviation content on Hacker News were acknowledged.
- Future improvements include more rigorous evaluations and advanced modeling techniques.
- The author highlights the effectiveness of small, pre-trained models for large-scale data analysis.