Hasty Briefsbeta

Bilingual

simdxml: structural indexing for XML

a day ago
  • #XML
  • #parsing
  • #SIMD
  • simdxml applies SIMD techniques to XML parsing, inspired by simdjson, to speed up processing.
  • The library uses a two-pass approach: SIMD classification of structural characters followed by sequential index-building.
  • XML's complexity, including nested quotes and attribute-heavy structures, makes SIMD parsing more challenging than JSON.
  • simdxml employs quote masking with carry-less multiplication to handle XML's quoting rules efficiently.
  • For text-heavy XML, simdxml uses a heuristic to skip SIMD processing when it's not beneficial.
  • The library stores parsed data in flat arrays for cache efficiency and lower memory usage compared to DOM-based parsers.
  • XPath queries are optimized using pre/post-order numbering and inverted posting lists for fast traversal.
  • Parallel parsing is implemented for large XML files by identifying safe split points between elements.
  • simdxml includes features like Bloom filter prescan and lazy index construction to optimize query performance.
  • Benchmarks show simdxml outperforming pugixml and xmllint in parsing and XPath evaluation, especially for large files.
  • The library is available as a Rust crate, Python package, DuckDB extension, and CLI tool.