I analyzed 571M Amazon reviews to find the most profanity-filled customer rants
6 hours ago
- #data-processing
- #amazon-reviews
- #content-analysis
- Processed 275 GB of Amazon reviews across 34 categories using a Burla cluster with 1,000 parallel workers.
- Ranked reviews by seven types of 'unhinged' content including profanity, screaming, punctuation bombs, and rants.
- Employed rule-based methods without LLMs, using word lists and metrics like caps-ratio and length for classification.
- Conducted three map-reduce passes to refine results, focusing on hard profanity and censor-aware lexicons.
- Filtered false positives from proper nouns and idioms, and prioritized angry product rants in the final corpus.
- Provided an interactive UI with Unhinged Mode to toggle between raw content (with auto-redacted slurs) and sanitized views.
- Open-source pipeline available on GitHub for reproduction on any Burla cluster in about 15 minutes.