Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)
a year ago
- #OCR
- #Educational Technology
- #Machine Learning
- OCR system designed for extracting structured data from educational materials like exam papers, optimized for ML training.
- Supports multilingual text, mathematical formulas, tables, diagrams, and charts.
- Semantically annotates extracted elements with contextual explanations, including natural language descriptions for visual content.
- Works with Japanese, Korean, and English, with customization options for additional languages.
- Generates AI-ready outputs in JSON or Markdown, including descriptions of mathematical expressions and figure captions.
- Achieves 90–95% accuracy on real-world academic datasets like EJU Biology and UTokyo Math.
- Accurately processes complex layouts with dense scientific content, formulas, and visual elements.
- Built with technologies like DocLayout-YOLO, Google Vision API, Gemini Pro Vision, MathPix OCR, OpenAI API, and OpenCV.
- Includes examples of outputs from real-world materials, such as EJU Biology and UTokyo Math, with English-translated semantic context.
- Open project under MIT License, encouraging community-driven enhancements and collaboration.