Glyph: Scaling Context Windows via Visual-Text Compression
6 months ago
- #long-context
- #compression
- #multimodal
- Glyph is a framework for scaling context length through visual-text compression.
- It renders long textual sequences into images and processes them using vision–language models (VLMs).
- Glyph transforms long-context modeling into a multimodal problem, reducing computational and memory costs.
- Provides competitive performance on LongBench and MRCR with significant compression and inference speedup.
- Includes a demo script for comparing Glyph with baseline text models like Qwen3/GLM4.
- Built on GLM-4.1V-9B-Base, with the fine-tuned model available on Hugging Face.
- Supports vLLM acceleration for improved throughput and response speed in long-context scenarios.
- Evaluation scripts provided for benchmarks including LONGBENCH, MRCR, and RULER.
- Known limitations include sensitivity to rendering parameters and OCR-related challenges.
- Citation provided for research use.