Optical Context Compression Is Just (Bad) Autoencoding
3 days ago
- #Context Compression
- #Computer Vision
- #Language Models
- DeepSeek-OCR shows high-fidelity text reconstruction from few vision tokens, sparking interest in vision-based context compression for language models.
- The study questions two assumptions: vision-based compression's unique advantages for text reconstruction and its usefulness for language modeling.
- Simple alternatives like parameter-free mean pooling and a learned hierarchical encoder match or surpass vision-based methods in reconstruction and outperform them in language modeling.
- Vision-based compression fails to outperform truncation in language modeling, suggesting current excitement may be premature.
- Code and checkpoints are available for further exploration.