Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
5 hours ago
- #joint embedding predictive architecture
- #representation learning
- #world models
- LeWorldModel (LeWM) is a new Joint Embedding Predictive Architecture (JEPA) designed for stable end-to-end training from pixels.
- Unlike prior JEPA methods, it avoids representation collapse using only two loss terms: a next-embedding prediction loss and a Gaussian regularization for latent embeddings.
- This reduces tunable hyperparameters from six to one, compared to existing end-to-end alternatives.
- LeWM has about 15 million parameters, can be trained on a single GPU in hours, and enables planning up to 48 times faster than foundation-model-based world models.
- It shows competitive performance across 2D and 3D control tasks, and its latent space encodes meaningful physical structure.
- Surprise evaluation confirms the model reliably detects physically implausible events, enhancing its robustness.