Rolling your own serverless OCR in 40 lines of code
3 months ago
- #Serverless
- #OCR
- #Machine Learning
- Author needed to make 'Bayesian Data Analysis' searchable for a statistics agent.
- Existing OCR tools were either limited or expensive for processing thousands of pages.
- DeepSeek's open OCR model was chosen for its good handling of mathematical notation.
- Modal, a serverless compute platform, was used to run the OCR model on cloud GPUs.
- The solution involved deploying a FastAPI server on Modal to process images into markdown text.
- Batched inference was implemented to process multiple pages simultaneously, improving efficiency.
- The OCR output was cleaned to remove unnecessary grounding tags, focusing on the text content.
- The entire process for a 600-page book took about 45 minutes on an A100 GPU, costing around $2.
- The resulting searchable text allows for easy grepping, explanation by AI, and building search indexes.
- The quality of OCR on mathematical content was noted to be surprisingly good.