Hasty Briefsbeta

Bilingual

OCR'ing 100k pages with open-source VLMs on Modal

12 hours ago
  • #Vision-Language Models
  • #OCR
  • #Self-Hosting
  • OCR'd 100,000 pages with open-source VLMs on Modal in under an hour for $223, significantly cheaper than proprietary APIs.
  • Key surprises: self-hosting open-source OCR is cheaper and easier than expected; GPU selection impacts completion time; benchmark based on workload, not leaderboards.
  • Chosen models for quality and cost: Chandra (best fidelity) and dots.ocr-1.5 (cost-effective).
  • Used Modal for infrastructure, simplifying GPU serving with volumes, secrets, and scaling controls.
  • Quantization (e.g., FP8) improved throughput and reduced costs without significant quality loss.
  • Proprietary APIs are more expensive for comparable quality and lack control over cost, model lifecycle, and deployment.
  • Self-hosting offers control over performance, cost, quality, and data privacy, making it viable for production-scale OCR.