Zpdf: PDF text extraction in Zig – 5x faster than MuPDF

4 months ago

A PDF text extraction library written in Zig.
Features memory-mapped file reading for efficient large file handling.
Supports streaming text extraction without intermediate allocations.
Includes multiple decompression filters: FlateDecode, ASCII85, ASCIIHex, LZW, RunLength.
Font encoding support includes WinAnsi, MacRoman, and ToUnicode CMap.
Capable of XRef table and stream parsing (PDF 1.5+).
Offers configurable error handling (strict or permissive).
Enables multi-threaded parallel page extraction.
Performance benchmarks show significant speedups compared to MuPDF.
Peak throughput: 41,000 pages/sec (Intel SDM, parallel).
Build with 'zig build -Doptimize=ReleaseFast' for optimal results.
Includes CLI tools for text extraction, document info, and benchmarking.
Library structure includes modules for parsing, decompression, encoding, and more.
Implemented features cover XRef parsing, incremental PDF updates, and CID font handling.
Licensed under MIT.

Hasty Briefsbeta