We accidentally solved robotics by watching 1M hours of YouTube
10 months ago
- #AI
- #machine learning
- #robotics
- The article discusses how scaling large language models (LLMs) doesn't solve robotics problems because robots need to understand physics, not just language.
- V-JEPA 2 is introduced as a solution, trained on 1 million hours of YouTube videos to predict the next moment in reality, not just the next word.
- The model uses a ViT-g encoder with 1 billion parameters to understand physical situations and a predictor to fill in masked video segments.
- V-JEPA 2-AC extends this by adding a transformer to predict outcomes of actions, trained on just 62 hours of raw robot footage.
- The model demonstrates zero-shot generalization, working in new environments with different objects and lighting, achieving high success rates in tasks like reaching and grasping.
- Planning with V-JEPA 2-AC is significantly faster than with diffusion models (16 seconds vs. 4 minutes per action).
- The model also performs well on video question answering when aligned with a language model, challenging the notion that language supervision is necessary for understanding the world.
- Limitations include sensitivity to camera positioning, long-horizon planning drift, and the need for visual goals instead of language instructions.
- Future possibilities include world models that rival text models in real-world grounding and robots that understand physics as well as ChatGPT understands language.