Hasty Briefsbeta

Bilingual

We OCR'ed 30k papers using Codex, open OCR models and Jobs

19 hours ago
  • #AI-Research
  • #HuggingFace
  • #OCR
  • Researchers can claim papers via Hugging Face accounts, link models, datasets, and spaces, upvote, comment, and tag organizations.
  • HuggingChat integration allows chatting with papers using HTML converted to Markdown, but 27,000 papers lacked HTML, requiring OCR.
  • Used OlmOCRBench to select Chandra-OCR 2 as the best open OCR model for converting documents to Markdown.
  • Leveraged Hugging Face Jobs with vLLM on GPUs for scalable processing; chose L40S GPUs for cost and speed efficiency.
  • Codex automated script creation for OCR processing, managed 16 parallel jobs, and estimated costs at $850 vs. higher API alternatives.
  • Switched to writing results to mounted Hugging Face Buckets via hf-mount for faster, git-free storage integration.
  • Completed OCR of 27,000 papers in about 30 hours, enabling chat functionality for all papers on the hub.