What we learned building a multi-agent PDF table extractor
4 hours ago
- #Document Processing
- #Table Extraction
- #Multi-Agent AI
- Extraction demos are easy but production-level extraction must handle unseen documents, human-centric layouts, and high volumes with 99%+ accuracy.
- Real-world challenges include diverse layouts for the same document type, lengthy tables exceeding LLM token limits, and designs optimized for human reading.
- A single autonomous LLM agent is too costly and slow for production; instead, a multi-agent pipeline splits tasks among six specialized agents.
- The pipeline uses preprocessing to convert documents into per-page layout-preserved text and images, enabling page-by-page processing.
- Agent #0 generates and caches prompts for specific table classes, leveraging LLM knowledge for zero-shot generalization.
- Agent #1 acts as a cheap table presence detector, filtering out irrelevant pages to reduce costs by up to 80%.
- Agent #2 extracts table metadata, recovering complex structures like headers, subtables, and row extra data.
- Agent #3 extracts content with full provenance, using OCR text as the source of truth and images for spatial reference.
- Agent #4 generates Python code to map extracted data to the user's target schema, enabling deterministic transformations and synthetic columns.
- Agent #5 executes the generated code in a sandboxed environment, ensuring security and deterministic output without additional LLM calls.
- Key production-ready features include caching, provenance tracking, and pushing deterministic tasks out of LLMs into code for reliability and cost efficiency.