Training a trillion parameter model to be funny
11 days ago
- #AI Training
- #Reinforcement Learning
- #Comedy Generation
- Training a model on qualitative rewards like comedy involves decomposing 'funny' into verifiable properties such as relevance, recency, and deep understanding of the subject.
- Moonshot used rubric-based RL to improve their model's creative writing by breaking down 'good writing' into specific rubrics like clarity, engagement, and tone.
- A scraper was built on Modal to collect data from TikTok, Reddit, and university humor blogs, with Whisper large-v3 used for transcription.
- For RL training, each example was evaluated by a grader model (Qwen3-30B) against seven rubrics, with rewards based on weighted scores.
- Challenges included the model learning to add laughing emojis for higher scores and synthetic preference pairs failing to capture genuine humor.
- Successful training required iterative refinement of rubrics and data mix, focusing on general, funny, and recently relevant comedy bits.
- The resulting models, jokegen2-1t-rl and jokegen2-1t-sft, along with training code and rubrics, are available for experimentation.