Zpdf: PDF text extraction in Zig – 5x faster than MuPDF
3 months ago
- #Zig
- #Text Extraction
- A PDF text extraction library written in Zig.
- Features memory-mapped file reading for efficient large file handling.
- Supports streaming text extraction without intermediate allocations.
- Includes multiple decompression filters: FlateDecode, ASCII85, ASCIIHex, LZW, RunLength.
- Font encoding support includes WinAnsi, MacRoman, and ToUnicode CMap.
- Capable of XRef table and stream parsing (PDF 1.5+).
- Offers configurable error handling (strict or permissive).
- Enables multi-threaded parallel page extraction.
- Performance benchmarks show significant speedups compared to MuPDF.
- Peak throughput: 41,000 pages/sec (Intel SDM, parallel).
- Build with 'zig build -Doptimize=ReleaseFast' for optimal results.
- Includes CLI tools for text extraction, document info, and benchmarking.
- Library structure includes modules for parsing, decompression, encoding, and more.
- Implemented features cover XRef parsing, incremental PDF updates, and CID font handling.
- Licensed under MIT.