The First Fully General Computer Action Model

a day ago

FDM-1 is a foundation model for computer use, trained on 11-million-hour screen recordings.
It uses an inverse dynamics model (IDM) for labeling actions like key presses and mouse movements.
The video encoder compresses nearly 2 hours of 30 FPS video into 1M tokens, 50x more efficient than previous methods.
FDM-1 can handle long-context tasks like CAD, finance, and engineering, improving with scale.
Training involves three stages: IDM training, labeling the video corpus, and autoregressive training of a forward dynamics model (FDM).
The video encoder uses a masked compression objective for high compression and semantic detail.
Evaluation infrastructure includes 80,000 forking virtual machines for scalable testing.
FDM-1 shows strong performance in tasks like object segmentation, 3D manipulation, and self-driving tests.
The model transitions computer action from data-constrained to compute-constrained regimes.
Future work aims to solve technical challenges for aligned general learners.

Hasty Briefsbeta