From PDFs to AI-ready structured data: a deep dive (2024)
4 hours ago
- #PDF Processing
- #NLP Workflows
- #Information Extraction
- New modular workflow presented for converting PDFs to structured data using Vision Language Models.
- PDFs are not ideal as a 'source of truth'; extracting data early is crucial for machine learning.
- spaCy and Docling integrated to handle PDF parsing, layout analysis, OCR, and table recognition.
- Docling outputs structured Doc objects, enabling NLP techniques like named entity recognition.
- Tables are extracted via TableFormer model and accessible as pandas DataFrames for processing.
- Prodigy tool used for data annotation, with recipes for manual PDF annotation and training models.
- Layout features should be evaluated for relevance; minimizing them can improve model generalization.
- Benchmarks show Docling runs at 1-3 pages per second on CPU, with plans for GPU support.
- Future research focuses on LLMs for tabular data, layout integration, and efficient annotation workflows.