Segmenting Robot Video into Actionable Subtasks
3 days ago
- #Robotics
- #VLMs
- #Benchmark
- Introducing WGO-Bench, a benchmark for robotics subtask annotation across 100 video episodes with 743 segments covering 62 task instructions.
- Over 60 experiments identified the best subtask annotation pipeline: best segmentation F1 of 0.306, labeling accuracy of 61.0%, and end-to-end F1 of 0.168.
- Gemini models, particularly Gemini 3.5 Flash, outperformed other models by 24.5% and are best for this task.
- Cost-effective method uses contact sheets, costing $2.64 per hour of video (batch pricing), roughly 19x cheaper than human annotation.
- Key techniques include visual timestamping on contact sheets, strict annotation protocols, and using previous/current/next segment context for labeling.
- Segmentation is the main bottleneck, especially for short subtasks, while labeling improves with contextual visual input.
- Open-sourced pipeline in Refiner for public use, enabling scalable subtask annotation without human intervention.