Hasty Briefsbeta

Bilingual

We accidentally solved robotics by watching 1M hours of YouTube

10 months ago
  • #AI
  • #machine learning
  • #robotics
  • The article discusses how scaling large language models (LLMs) doesn't solve robotics problems because robots need to understand physics, not just language.
  • V-JEPA 2 is introduced as a solution, trained on 1 million hours of YouTube videos to predict the next moment in reality, not just the next word.
  • The model uses a ViT-g encoder with 1 billion parameters to understand physical situations and a predictor to fill in masked video segments.
  • V-JEPA 2-AC extends this by adding a transformer to predict outcomes of actions, trained on just 62 hours of raw robot footage.
  • The model demonstrates zero-shot generalization, working in new environments with different objects and lighting, achieving high success rates in tasks like reaching and grasping.
  • Planning with V-JEPA 2-AC is significantly faster than with diffusion models (16 seconds vs. 4 minutes per action).
  • The model also performs well on video question answering when aligned with a language model, challenging the notion that language supervision is necessary for understanding the world.
  • Limitations include sensitivity to camera positioning, long-horizon planning drift, and the need for visual goals instead of language instructions.
  • Future possibilities include world models that rival text models in real-world grounding and robots that understand physics as well as ChatGPT understands language.