Hasty Briefsbeta

Bilingual

Show HN: OCR pipeline for ML training (tables, diagrams, math, multilingual)

a year ago
  • #OCR
  • #Educational Technology
  • #Machine Learning
  • OCR system designed for extracting structured data from educational materials like exam papers, optimized for ML training.
  • Supports multilingual text, mathematical formulas, tables, diagrams, and charts.
  • Semantically annotates extracted elements with contextual explanations, including natural language descriptions for visual content.
  • Works with Japanese, Korean, and English, with customization options for additional languages.
  • Generates AI-ready outputs in JSON or Markdown, including descriptions of mathematical expressions and figure captions.
  • Achieves 90–95% accuracy on real-world academic datasets like EJU Biology and UTokyo Math.
  • Accurately processes complex layouts with dense scientific content, formulas, and visual elements.
  • Built with technologies like DocLayout-YOLO, Google Vision API, Gemini Pro Vision, MathPix OCR, OpenAI API, and OpenCV.
  • Includes examples of outputs from real-world materials, such as EJU Biology and UTokyo Math, with English-translated semantic context.
  • Open project under MIT License, encouraging community-driven enhancements and collaboration.