How Google built its Gemini robotics models

a year ago

Google DeepMind developed a new family of Gemini Robotics models, specifically designed for robots.
The models are multimodal, building upon Gemini 2.0 and fine-tuned with robot-specific data to enable physical actions alongside text, video, and audio outputs.
A bi-arm ALOHA robot successfully performed novel tasks like placing pens inside a shoe and executing a slam dunk with a toy basketball, demonstrating the model's adaptability.
Gemini Robotics models are highly dextrous, interactive, and general, allowing robots to react to new objects, environments, and instructions without additional training.
Two main functions are essential for robots: understanding and decision-making (handled by Gemini Robotics-ER) and taking action (handled by Gemini Robotics).
Gemini Robotics-ER excels in embodied reasoning, detecting objects, and generating code for actions, while Gemini Robotics advances dexterity and multi-step task completion.
The models adapt to various robot embodiments, from academic robots like ALOHA to humanoid robots like Apollo, enabling diverse applications.
Potential future applications include complex industrial settings and human-centric spaces like homes, though widespread adoption is still years away.

Hasty Briefsbeta