Rolling your own serverless OCR in 40 lines of code

3 months ago

Author needed to make 'Bayesian Data Analysis' searchable for a statistics agent.
Existing OCR tools were either limited or expensive for processing thousands of pages.
DeepSeek's open OCR model was chosen for its good handling of mathematical notation.
Modal, a serverless compute platform, was used to run the OCR model on cloud GPUs.
The solution involved deploying a FastAPI server on Modal to process images into markdown text.
Batched inference was implemented to process multiple pages simultaneously, improving efficiency.
The OCR output was cleaned to remove unnecessary grounding tags, focusing on the text content.
The entire process for a 600-page book took about 45 minutes on an A100 GPU, costing around $2.
The resulting searchable text allows for easy grepping, explanation by AI, and building search indexes.
The quality of OCR on mathematical content was noted to be surprisingly good.

Hasty Briefsbeta