Real-time action chunking with large models

a year ago

Robots must operate in real time, unlike chatbots or image generators, as delays between inputs and outputs impact performance.
Vision-Language-Action Models (VLAs) show promise but are slow due to their large size and reliance on heavy-duty GPUs.
Action chunking (executing multiple actions per inference call) helps but can cause discontinuities between chunks.
Initial models (π0, π0-FAST, π0.5) used synchronous execution, leading to pauses between chunks that degrade performance.
Real-Time Chunking (RTC) was developed to enable seamless, real-time execution without discontinuities.
RTC treats chunk transitions as an inpainting problem, ensuring consistency between overlapping actions.
Diffusion and flow models naturally excel at inpainting, making RTC effective without training modifications.
Experiments showed RTC improves speed and precision, handling high inference delays (up to 300ms) robustly.
RTC maintains performance even with artificially increased latency, unlike synchronous methods.
Future robot systems will need multi-level, real-time inference for complex tasks as models scale up.

Hasty Briefsbeta