OCR'ing 100k pages with open-source VLMs on Modal

12 hours ago

OCR'd 100,000 pages with open-source VLMs on Modal in under an hour for $223, significantly cheaper than proprietary APIs.
Key surprises: self-hosting open-source OCR is cheaper and easier than expected; GPU selection impacts completion time; benchmark based on workload, not leaderboards.
Chosen models for quality and cost: Chandra (best fidelity) and dots.ocr-1.5 (cost-effective).
Used Modal for infrastructure, simplifying GPU serving with volumes, secrets, and scaling controls.
Quantization (e.g., FP8) improved throughput and reduced costs without significant quality loss.
Proprietary APIs are more expensive for comparable quality and lack control over cost, model lifecycle, and deployment.
Self-hosting offers control over performance, cost, quality, and data privacy, making it viable for production-scale OCR.

Hasty Briefsbeta