Hasty Briefsbeta

Bilingual

25 Years of Eggs

2 days ago
  • #AI
  • #Data Extraction
  • #Receipt Analysis
  • The author scanned receipts since 2001, waiting for technology to catch up to extract data.
  • Used AI coding agents (Codex and Claude) to process 11,345 receipts over 14 days, consuming 1.6 billion tokens.
  • Faced challenges with receipt segmentation due to 'shades of white' problem, solved using Meta’s SAM3.
  • Discovered that Claude could read receipts perfectly without needing a rotation pipeline.
  • Replaced Tesseract with PaddleOCR-VL for better OCR results, handling tall receipts by dynamic slicing.
  • Structured extraction evolved from regex to using Codex/Claude, improving accuracy and efficiency.
  • Built a classifier using hand-labeled data, which outperformed ground truth with 99%+ accuracy.
  • Final data quality was 96% correct, with errors mostly from garbled OCR on old scans.
  • Total egg spend over 25 years was $1,972 for 8,604 eggs across 589 receipts.
  • Combined specialized models (SAM3, PaddleOCR, Codex, Claude) for optimal results.