OpenDataLoader-PDF: An open source tool for structured PDF parsing
8 hours ago
- #AI-integration
- #document-layout
- #PDF-processing
- OpenDataLoader-PDF converts PDFs into JSON, Markdown, or HTML for AI stacks like LLMs, vector search, and RAG.
- It reconstructs document layout (headings, lists, tables, reading order) for easier chunking, indexing, and querying.
- Runs locally with high-throughput processing and includes AI-safety features to filter prompt-injection content.
- Supports rich structured outputs (JSON, Markdown, HTML) and layout reconstruction for various document elements.
- Features include OCR for scanned PDFs, table AI for borderless/merged cells, and annotated PDF visualization.
- Performance benchmarks and AI red teaming are transparently reported with open datasets and metrics.
- Requires Java 11+ and Python 3.9+ for installation and operation.
- CLI and API options available for processing PDFs with customizable output formats and safety filters.
- Includes detailed documentation for API usage, configuration options, and contribution guidelines.
- Licensed under Mozilla Public License 2.0 with brand usage guidelines for trademarks and logos.