Hasty Briefsbeta

Bilingual

Segmenting Robot Video into Actionable Subtasks

3 days ago
  • #Robotics
  • #VLMs
  • #Benchmark
  • Introducing WGO-Bench, a benchmark for robotics subtask annotation across 100 video episodes with 743 segments covering 62 task instructions.
  • Over 60 experiments identified the best subtask annotation pipeline: best segmentation F1 of 0.306, labeling accuracy of 61.0%, and end-to-end F1 of 0.168.
  • Gemini models, particularly Gemini 3.5 Flash, outperformed other models by 24.5% and are best for this task.
  • Cost-effective method uses contact sheets, costing $2.64 per hour of video (batch pricing), roughly 19x cheaper than human annotation.
  • Key techniques include visual timestamping on contact sheets, strict annotation protocols, and using previous/current/next segment context for labeling.
  • Segmentation is the main bottleneck, especially for short subtasks, while labeling improves with contextual visual input.
  • Open-sourced pipeline in Refiner for public use, enabling scalable subtask annotation without human intervention.