Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)

a year ago

OCR system designed for extracting structured data from educational materials like exam papers, optimized for ML training.
Supports multilingual text, mathematical formulas, tables, diagrams, and charts.
Semantically annotates extracted elements with contextual explanations, including natural language descriptions for visual content.
Works with Japanese, Korean, and English, with customization options for additional languages.
Generates AI-ready outputs in JSON or Markdown, including descriptions of mathematical expressions and figure captions.
Achieves 90–95% accuracy on real-world academic datasets like EJU Biology and UTokyo Math.
Accurately processes complex layouts with dense scientific content, formulas, and visual elements.
Built with technologies like DocLayout-YOLO, Google Vision API, Gemini Pro Vision, MathPix OCR, OpenAI API, and OpenCV.
Includes examples of outputs from real-world materials, such as EJU Biology and UTokyo Math, with English-translated semantic context.
Open project under MIT License, encouraging community-driven enhancements and collaboration.

Hasty Briefsbeta