Hasty Briefsbeta

Bilingual

Zpdf: PDF text extraction in Zig – 5x faster than MuPDF

3 months ago
  • #PDF
  • #Zig
  • #Text Extraction
  • A PDF text extraction library written in Zig.
  • Features memory-mapped file reading for efficient large file handling.
  • Supports streaming text extraction without intermediate allocations.
  • Includes multiple decompression filters: FlateDecode, ASCII85, ASCIIHex, LZW, RunLength.
  • Font encoding support includes WinAnsi, MacRoman, and ToUnicode CMap.
  • Capable of XRef table and stream parsing (PDF 1.5+).
  • Offers configurable error handling (strict or permissive).
  • Enables multi-threaded parallel page extraction.
  • Performance benchmarks show significant speedups compared to MuPDF.
  • Peak throughput: 41,000 pages/sec (Intel SDM, parallel).
  • Build with 'zig build -Doptimize=ReleaseFast' for optimal results.
  • Includes CLI tools for text extraction, document info, and benchmarking.
  • Library structure includes modules for parsing, decompression, encoding, and more.
  • Implemented features cover XRef parsing, incremental PDF updates, and CID font handling.
  • Licensed under MIT.