We OCR'ed 30k papers using Codex, open OCR models and Jobs
19 hours ago
- #AI-Research
- #HuggingFace
- #OCR
- Researchers can claim papers via Hugging Face accounts, link models, datasets, and spaces, upvote, comment, and tag organizations.
- HuggingChat integration allows chatting with papers using HTML converted to Markdown, but 27,000 papers lacked HTML, requiring OCR.
- Used OlmOCRBench to select Chandra-OCR 2 as the best open OCR model for converting documents to Markdown.
- Leveraged Hugging Face Jobs with vLLM on GPUs for scalable processing; chose L40S GPUs for cost and speed efficiency.
- Codex automated script creation for OCR processing, managed 16 parallel jobs, and estimated costs at $850 vs. higher API alternatives.
- Switched to writing results to mounted Hugging Face Buckets via hf-mount for faster, git-free storage integration.
- Completed OCR of 27,000 papers in about 30 hours, enabling chat functionality for all papers on the hub.