Sweatshop Data Is Over

17 days ago

Copy Link

High-quality data is crucial for AI progress, but current methods need updating.
Early AI systems relied on 'sweatshop data'—low-skill, repetitive tasks performed by cheap labor.
Modern AI struggles with complex, long-horizon tasks like managing software projects or debugging systems.
Training AI for advanced roles (e.g., infrastructure engineer) requires sophisticated RL environments, not just static datasets.
Current AI coding tools fail at handling complex, real-world software challenges.
Three key changes needed: shift to interactive software environments, full-time specialists over contractors, and deep expertise integration.
Subject-matter experts are now the bottleneck for AI progress, requiring their tacit knowledge to be encoded into AI systems.
Historically, data importance was underestimated; training on the right data (e.g., GPT-3’s natural language) made a difference.
Pretraining is saturating; GPT-4.5 didn’t feel as revolutionary as GPT-4 over GPT-3.5.
RLVR (reinforcement learning with verifiable rewards) helps but isn’t enough for open-ended real-world tasks.
Better RL environments are needed to simulate reality and reward AI for skillful navigation.

Hasty Briefsbeta