Learning to Model the World with Language

6 months ago

Dynalang is an agent that learns to understand and leverage diverse language to predict future observations, world behavior, and rewards.
It uses a multimodal world model to predict future text and image representations, improving task performance through imagined model rollouts.
Dynalang can be pretrained on text or video datasets without actions or rewards, enabling it to benefit from large-scale offline data.
The agent outperforms state-of-the-art RL algorithms and task-specific architectures in tasks like grid worlds and photorealistic home navigation.
Dynalang unifies language understanding with future prediction, allowing it to handle environment descriptions, game rules, and instructions effectively.
It models video and text as a unified sequence, similar to human perception, improving both pretraining and RL performance.
The agent can also generate language grounded in the environment, showcasing capabilities in embodied question answering.
Pretraining Dynalang on general text data enhances downstream task performance, demonstrating the versatility of its architecture.

Hasty Briefsbeta