The First Fully General Computer Action Model
a day ago
- #AI
- #Machine Learning
- #Computer Vision
- FDM-1 is a foundation model for computer use, trained on 11-million-hour screen recordings.
- It uses an inverse dynamics model (IDM) for labeling actions like key presses and mouse movements.
- The video encoder compresses nearly 2 hours of 30 FPS video into 1M tokens, 50x more efficient than previous methods.
- FDM-1 can handle long-context tasks like CAD, finance, and engineering, improving with scale.
- Training involves three stages: IDM training, labeling the video corpus, and autoregressive training of a forward dynamics model (FDM).
- The video encoder uses a masked compression objective for high compression and semantic detail.
- Evaluation infrastructure includes 80,000 forking virtual machines for scalable testing.
- FDM-1 shows strong performance in tasks like object segmentation, 3D manipulation, and self-driving tests.
- The model transitions computer action from data-constrained to compute-constrained regimes.
- Future work aims to solve technical challenges for aligned general learners.