A PDF that changes based on who is reading
5 hours ago
- #PDF Format
- #Accessibility
- #Document Structure
- PDF is a visual format often lacking structural tags, making machine extraction and interpretation challenging.
- A 'Smart PDF' technique uses the PDF spec's replacement text property to embed structured markdown alongside visual content.
- Extractors that support the property return clean markdown with headings, lists, and tables, while renderers show the original format.
- Benchmarks show token counts remain similar, but structured markdown increases information density per token for LLMs.
- This creates adaptive documents: humans see formatted PDFs, machines extract structured markdown from the same file.