We OCR'ed 30k papers using Codex, open OCR models and Jobs

19 hours ago

Researchers can claim papers via Hugging Face accounts, link models, datasets, and spaces, upvote, comment, and tag organizations.
HuggingChat integration allows chatting with papers using HTML converted to Markdown, but 27,000 papers lacked HTML, requiring OCR.
Used OlmOCRBench to select Chandra-OCR 2 as the best open OCR model for converting documents to Markdown.
Leveraged Hugging Face Jobs with vLLM on GPUs for scalable processing; chose L40S GPUs for cost and speed efficiency.
Codex automated script creation for OCR processing, managed 16 parallel jobs, and estimated costs at $850 vs. higher API alternatives.
Switched to writing results to mounted Hugging Face Buckets via hf-mount for faster, git-free storage integration.
Completed OCR of 27,000 papers in about 30 hours, enabling chat functionality for all papers on the hub.

Hasty Briefsbeta