Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

a month ago

LeWorldModel (LeWM) is a new Joint Embedding Predictive Architecture (JEPA) designed for stable end-to-end training from pixels.
Unlike prior JEPA methods, it avoids representation collapse using only two loss terms: a next-embedding prediction loss and a Gaussian regularization for latent embeddings.
This reduces tunable hyperparameters from six to one, compared to existing end-to-end alternatives.
LeWM has about 15 million parameters, can be trained on a single GPU in hours, and enables planning up to 48 times faster than foundation-model-based world models.
It shows competitive performance across 2D and 3D control tasks, and its latent space encodes meaningful physical structure.
Surprise evaluation confirms the model reliably detects physically implausible events, enhancing its robustness.

Hasty Briefsbeta