OCR'ing 100k pages with open-source VLMs on Modal
14 hours ago
- #Vision-Language Models
- #OCR
- #Self-Hosting
- OCR'd 100,000 pages with open-source VLMs on Modal in under an hour for $223, significantly cheaper than proprietary APIs.
- Key surprises: self-hosting open-source OCR is cheaper and easier than expected; GPU selection impacts completion time; benchmark based on workload, not leaderboards.
- Chosen models for quality and cost: Chandra (best fidelity) and dots.ocr-1.5 (cost-effective).
- Used Modal for infrastructure, simplifying GPU serving with volumes, secrets, and scaling controls.
- Quantization (e.g., FP8) improved throughput and reduced costs without significant quality loss.
- Proprietary APIs are more expensive for comparable quality and lack control over cost, model lifecycle, and deployment.
- Self-hosting offers control over performance, cost, quality, and data privacy, making it viable for production-scale OCR.