Hasty Briefsbeta

Bilingual

Real-time action chunking with large models

a year ago
  • #VLAs
  • #real-time AI
  • #robotics
  • Robots must operate in real time, unlike chatbots or image generators, as delays between inputs and outputs impact performance.
  • Vision-Language-Action Models (VLAs) show promise but are slow due to their large size and reliance on heavy-duty GPUs.
  • Action chunking (executing multiple actions per inference call) helps but can cause discontinuities between chunks.
  • Initial models (π0, π0-FAST, π0.5) used synchronous execution, leading to pauses between chunks that degrade performance.
  • Real-Time Chunking (RTC) was developed to enable seamless, real-time execution without discontinuities.
  • RTC treats chunk transitions as an inpainting problem, ensuring consistency between overlapping actions.
  • Diffusion and flow models naturally excel at inpainting, making RTC effective without training modifications.
  • Experiments showed RTC improves speed and precision, handling high inference delays (up to 300ms) robustly.
  • RTC maintains performance even with artificially increased latency, unlike synchronous methods.
  • Future robot systems will need multi-level, real-time inference for complex tasks as models scale up.