Real-time action chunking with large models
a year ago
- #VLAs
- #real-time AI
- #robotics
- Robots must operate in real time, unlike chatbots or image generators, as delays between inputs and outputs impact performance.
- Vision-Language-Action Models (VLAs) show promise but are slow due to their large size and reliance on heavy-duty GPUs.
- Action chunking (executing multiple actions per inference call) helps but can cause discontinuities between chunks.
- Initial models (π0, π0-FAST, π0.5) used synchronous execution, leading to pauses between chunks that degrade performance.
- Real-Time Chunking (RTC) was developed to enable seamless, real-time execution without discontinuities.
- RTC treats chunk transitions as an inpainting problem, ensuring consistency between overlapping actions.
- Diffusion and flow models naturally excel at inpainting, making RTC effective without training modifications.
- Experiments showed RTC improves speed and precision, handling high inference delays (up to 300ms) robustly.
- RTC maintains performance even with artificially increased latency, unlike synchronous methods.
- Future robot systems will need multi-level, real-time inference for complex tasks as models scale up.