Hasty Briefsbeta

Training a trillion parameter model to be funny

11 days ago
  • #AI Training
  • #Reinforcement Learning
  • #Comedy Generation
  • Training a model on qualitative rewards like comedy involves decomposing 'funny' into verifiable properties such as relevance, recency, and deep understanding of the subject.
  • Moonshot used rubric-based RL to improve their model's creative writing by breaking down 'good writing' into specific rubrics like clarity, engagement, and tone.
  • A scraper was built on Modal to collect data from TikTok, Reddit, and university humor blogs, with Whisper large-v3 used for transcription.
  • For RL training, each example was evaluated by a grader model (Qwen3-30B) against seven rubrics, with rewards based on weighted scores.
  • Challenges included the model learning to add laughing emojis for higher scores and synthetic preference pairs failing to capture genuine humor.
  • Successful training required iterative refinement of rubrics and data mix, focusing on general, funny, and recently relevant comedy bits.
  • The resulting models, jokegen2-1t-rl and jokegen2-1t-sft, along with training code and rubrics, are available for experimentation.