Should LLMs just treat text content as an image?
7 months ago
- #AI
- #OCR
- #Optical Compression
- DeepSeek's OCR paper suggests converting text into images for more efficient processing by AI models, termed 'optical compression'.
- Optical compression leverages the fact that image tokens can represent more information than text tokens, potentially allowing models to process 10x more data.
- This method is inspired by human memory, where recent memories are vivid but older ones become blurrier, suggesting a similar approach for handling long-form text.
- Text tokens are discrete and less efficient compared to continuous image tokens, which can encode more information in a single token.
- Processing text as images might align more closely with human cognition, as humans perceive text visually rather than as raw textual data.
- Despite the potential, implementing this in current multimodal LLMs is challenging and hasn't become common practice due to technical hurdles.
- Training a model on text-as-images would require innovative approaches, such as generating images of words or integrating text token knowledge, which complicates the process.
- The digitization of books is incomplete, with only about 30% of written books digitized, highlighting the need for efficient text processing methods like optical compression.