I analyzed 571M Amazon reviews to find the most profanity-filled customer rants

6 hours ago

Processed 275 GB of Amazon reviews across 34 categories using a Burla cluster with 1,000 parallel workers.
Ranked reviews by seven types of 'unhinged' content including profanity, screaming, punctuation bombs, and rants.
Employed rule-based methods without LLMs, using word lists and metrics like caps-ratio and length for classification.
Conducted three map-reduce passes to refine results, focusing on hard profanity and censor-aware lexicons.
Filtered false positives from proper nouns and idioms, and prioritized angry product rants in the final corpus.
Provided an interactive UI with Unhinged Mode to toggle between raw content (with auto-redacted slurs) and sanitized views.
Open-source pipeline available on GitHub for reproduction on any Burla cluster in about 15 minutes.

Hasty Briefsbeta