Hasty Briefsbeta

Bilingual

What we learned building a multi-agent PDF table extractor

4 hours ago
  • #Document Processing
  • #Table Extraction
  • #Multi-Agent AI
  • Extraction demos are easy but production-level extraction must handle unseen documents, human-centric layouts, and high volumes with 99%+ accuracy.
  • Real-world challenges include diverse layouts for the same document type, lengthy tables exceeding LLM token limits, and designs optimized for human reading.
  • A single autonomous LLM agent is too costly and slow for production; instead, a multi-agent pipeline splits tasks among six specialized agents.
  • The pipeline uses preprocessing to convert documents into per-page layout-preserved text and images, enabling page-by-page processing.
  • Agent #0 generates and caches prompts for specific table classes, leveraging LLM knowledge for zero-shot generalization.
  • Agent #1 acts as a cheap table presence detector, filtering out irrelevant pages to reduce costs by up to 80%.
  • Agent #2 extracts table metadata, recovering complex structures like headers, subtables, and row extra data.
  • Agent #3 extracts content with full provenance, using OCR text as the source of truth and images for spatial reference.
  • Agent #4 generates Python code to map extracted data to the user's target schema, enabling deterministic transformations and synthetic columns.
  • Agent #5 executes the generated code in a sandboxed environment, ensuring security and deterministic output without additional LLM calls.
  • Key production-ready features include caching, provenance tracking, and pushing deterministic tasks out of LLMs into code for reliability and cost efficiency.