25 Years of Eggs
2 days ago
- #AI
- #Data Extraction
- #Receipt Analysis
- The author scanned receipts since 2001, waiting for technology to catch up to extract data.
- Used AI coding agents (Codex and Claude) to process 11,345 receipts over 14 days, consuming 1.6 billion tokens.
- Faced challenges with receipt segmentation due to 'shades of white' problem, solved using Meta’s SAM3.
- Discovered that Claude could read receipts perfectly without needing a rotation pipeline.
- Replaced Tesseract with PaddleOCR-VL for better OCR results, handling tall receipts by dynamic slicing.
- Structured extraction evolved from regex to using Codex/Claude, improving accuracy and efficiency.
- Built a classifier using hand-labeled data, which outperformed ground truth with 99%+ accuracy.
- Final data quality was 96% correct, with errors mostly from garbled OCR on old scans.
- Total egg spend over 25 years was $1,972 for 8,604 eggs across 589 receipts.
- Combined specialized models (SAM3, PaddleOCR, Codex, Claude) for optimal results.