What we learned building a multi-agent PDF table extractor

4 hours ago

Extraction demos are easy but production-level extraction must handle unseen documents, human-centric layouts, and high volumes with 99%+ accuracy.
Real-world challenges include diverse layouts for the same document type, lengthy tables exceeding LLM token limits, and designs optimized for human reading.
A single autonomous LLM agent is too costly and slow for production; instead, a multi-agent pipeline splits tasks among six specialized agents.
The pipeline uses preprocessing to convert documents into per-page layout-preserved text and images, enabling page-by-page processing.
Agent #0 generates and caches prompts for specific table classes, leveraging LLM knowledge for zero-shot generalization.
Agent #1 acts as a cheap table presence detector, filtering out irrelevant pages to reduce costs by up to 80%.
Agent #2 extracts table metadata, recovering complex structures like headers, subtables, and row extra data.
Agent #3 extracts content with full provenance, using OCR text as the source of truth and images for spatial reference.
Agent #4 generates Python code to map extracted data to the user's target schema, enabling deterministic transformations and synthetic columns.
Agent #5 executes the generated code in a sandboxed environment, ensuring security and deterministic output without additional LLM calls.
Key production-ready features include caching, provenance tracking, and pushing deterministic tasks out of LLMs into code for reliability and cost efficiency.

Hasty Briefsbeta