The Role of Feature Normalization in Ijepa

12 days ago

Copy Link

Uses UV for dependency management.
Requires downloading datasets and NYU-Depth tar files, needing ~100GB storage.
Default training configuration involves a ~300m parameter ViT-Small, consuming ~22GB VRAM over 116 hours.
Supports resuming training runs, evaluating IN1k validation performance, visualizing features, and plotting losses.
Token_ids in the code are LongTensors with four integers per token: register id, sample id, height id, width id.
Model processes batches with patches from varied resolution images, differing from standard ViT models.
Pytorch's eval mode affects the model's forward pass, requiring model.eval() before evaluation.
LiDAR score is computed from a random subset of training data, potentially changing upon resuming runs.
Supports single-GPU training only, with optional PILLOW-SIMD for faster dataloading.
Hidden features include TOME, absolute factorized learnable position embeddings, and various predictor training options.
Adding register tokens to the encoder and predictor was found to decrease performance significantly.

Hasty Briefsbeta