Hasty Briefsbeta

Bilingual

The First Fully General Computer Action Model

a day ago
  • #AI
  • #Machine Learning
  • #Computer Vision
  • FDM-1 is a foundation model for computer use, trained on 11-million-hour screen recordings.
  • It uses an inverse dynamics model (IDM) for labeling actions like key presses and mouse movements.
  • The video encoder compresses nearly 2 hours of 30 FPS video into 1M tokens, 50x more efficient than previous methods.
  • FDM-1 can handle long-context tasks like CAD, finance, and engineering, improving with scale.
  • Training involves three stages: IDM training, labeling the video corpus, and autoregressive training of a forward dynamics model (FDM).
  • The video encoder uses a masked compression objective for high compression and semantic detail.
  • Evaluation infrastructure includes 80,000 forking virtual machines for scalable testing.
  • FDM-1 shows strong performance in tasks like object segmentation, 3D manipulation, and self-driving tests.
  • The model transitions computer action from data-constrained to compute-constrained regimes.
  • Future work aims to solve technical challenges for aligned general learners.