From PDFs to AI-ready structured data: a deep dive (2024)

4 hours ago

New modular workflow presented for converting PDFs to structured data using Vision Language Models.
PDFs are not ideal as a 'source of truth'; extracting data early is crucial for machine learning.
spaCy and Docling integrated to handle PDF parsing, layout analysis, OCR, and table recognition.
Docling outputs structured Doc objects, enabling NLP techniques like named entity recognition.
Tables are extracted via TableFormer model and accessible as pandas DataFrames for processing.
Prodigy tool used for data annotation, with recipes for manual PDF annotation and training models.
Layout features should be evaluated for relevance; minimizing them can improve model generalization.
Benchmarks show Docling runs at 1-3 pages per second on CPU, with plans for GPU support.
Future research focuses on LLMs for tabular data, layout integration, and efficient annotation workflows.

Hasty Briefsbeta