Segmenting Robot Video into Actionable Subtasks

25 days ago

Introducing WGO-Bench, a benchmark for robotics subtask annotation across 100 video episodes with 743 segments covering 62 task instructions.
Over 60 experiments identified the best subtask annotation pipeline: best segmentation F1 of 0.306, labeling accuracy of 61.0%, and end-to-end F1 of 0.168.
Gemini models, particularly Gemini 3.5 Flash, outperformed other models by 24.5% and are best for this task.
Cost-effective method uses contact sheets, costing $2.64 per hour of video (batch pricing), roughly 19x cheaper than human annotation.
Key techniques include visual timestamping on contact sheets, strict annotation protocols, and using previous/current/next segment context for labeling.
Segmentation is the main bottleneck, especially for short subtasks, while labeling improves with contextual visual input.
Open-sourced pipeline in Refiner for public use, enabling scalable subtask annotation without human intervention.

Hasty Briefsbeta