We accidentally solved robotics by watching 1M hours of YouTube

a year ago

The article discusses how scaling large language models (LLMs) doesn't solve robotics problems because robots need to understand physics, not just language.
V-JEPA 2 is introduced as a solution, trained on 1 million hours of YouTube videos to predict the next moment in reality, not just the next word.
The model uses a ViT-g encoder with 1 billion parameters to understand physical situations and a predictor to fill in masked video segments.
V-JEPA 2-AC extends this by adding a transformer to predict outcomes of actions, trained on just 62 hours of raw robot footage.
The model demonstrates zero-shot generalization, working in new environments with different objects and lighting, achieving high success rates in tasks like reaching and grasping.
Planning with V-JEPA 2-AC is significantly faster than with diffusion models (16 seconds vs. 4 minutes per action).
The model also performs well on video question answering when aligned with a language model, challenging the notion that language supervision is necessary for understanding the world.
Limitations include sensitivity to camera positioning, long-horizon planning drift, and the need for visual goals instead of language instructions.
Future possibilities include world models that rival text models in real-world grounding and robots that understand physics as well as ChatGPT understands language.

Hasty Briefsbeta