Hasty Briefsbeta

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Compact VLM

a day ago
  • #vision-language-model
  • #document-parsing
  • #multilingual-OCR
  • PaddleOCR-VL is a 0.9B ultra-compact vision-language model (VLM) designed for multilingual document parsing.
  • It integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model for accurate element recognition.
  • Supports 109 languages, including Chinese, English, Japanese, Russian, Arabic, Hindi, and Thai.
  • Achieves state-of-the-art (SOTA) performance in page-level document parsing and element-level recognition.
  • Excels in recognizing complex elements like text, tables, formulas, and charts.
  • Optimized for resource-efficient inference with fast processing speeds.
  • Includes CLI and Python API for easy integration and usage.
  • Supports accelerated inference via VLLM server for improved performance.
  • Outperforms existing solutions in benchmarks like OmniDocBench and in-house evaluations.
  • Open-source with detailed installation and usage instructions provided.