Should LLMs just treat text content as an image?

7 months ago

DeepSeek's OCR paper suggests converting text into images for more efficient processing by AI models, termed 'optical compression'.
Optical compression leverages the fact that image tokens can represent more information than text tokens, potentially allowing models to process 10x more data.
This method is inspired by human memory, where recent memories are vivid but older ones become blurrier, suggesting a similar approach for handling long-form text.
Text tokens are discrete and less efficient compared to continuous image tokens, which can encode more information in a single token.
Processing text as images might align more closely with human cognition, as humans perceive text visually rather than as raw textual data.
Despite the potential, implementing this in current multimodal LLMs is challenging and hasn't become common practice due to technical hurdles.
Training a model on text-as-images would require innovative approaches, such as generating images of words or integrating text token knowledge, which complicates the process.
The digitization of books is incomplete, with only about 30% of written books digitized, highlighting the need for efficient text processing methods like optical compression.

Hasty Briefsbeta