PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Compact VLM
a day ago
- #vision-language-model
- #document-parsing
- #multilingual-OCR
- PaddleOCR-VL is a 0.9B ultra-compact vision-language model (VLM) designed for multilingual document parsing.
- It integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model for accurate element recognition.
- Supports 109 languages, including Chinese, English, Japanese, Russian, Arabic, Hindi, and Thai.
- Achieves state-of-the-art (SOTA) performance in page-level document parsing and element-level recognition.
- Excels in recognizing complex elements like text, tables, formulas, and charts.
- Optimized for resource-efficient inference with fast processing speeds.
- Includes CLI and Python API for easy integration and usage.
- Supports accelerated inference via VLLM server for improved performance.
- Outperforms existing solutions in benchmarks like OmniDocBench and in-house evaluations.
- Open-source with detailed installation and usage instructions provided.