The End of the Train-Test Split

7 days ago

Copy Link

Building a butt classification model at Facebook involves training a CNN for edge detection with high precision and recall.
Policy team requests a more context-aware model for sexually suggestive content, leading to challenges with LLM decision trees and lower accuracy.
Discrepancies in labels and policy ambiguity cause issues, with outsourced labelers struggling with nuanced definitions like 'sexually suggestive'.
Expert input is crucial for complex tasks, but their limited availability makes large labeled datasets difficult to maintain.
LLMs require clear natural language rules and examples rather than traditional training sets, shifting focus from hyperparameter tuning to policy alignment.
High error rates in 'golden sets' and expert disagreement highlight the need for continuous feedback loops between policy and engineering teams.
Traditional train-test splits fail for complex LLM tasks due to label ambiguity and the need for expert review of model explanations.
Shadow mode testing and direct communication between teams are essential for resolving edge cases and improving model accuracy.
LLMs excel at enforcing natural language rules but require rigorous policy alignment and ongoing evaluation to handle complex classifications.
The future of LLMs in tasks like legal or content moderation depends on solving alignment challenges and improving model self-awareness of errors.

Hasty Briefsbeta