simdxml: structural indexing for XML

a day ago

simdxml applies SIMD techniques to XML parsing, inspired by simdjson, to speed up processing.
The library uses a two-pass approach: SIMD classification of structural characters followed by sequential index-building.
XML's complexity, including nested quotes and attribute-heavy structures, makes SIMD parsing more challenging than JSON.
simdxml employs quote masking with carry-less multiplication to handle XML's quoting rules efficiently.
For text-heavy XML, simdxml uses a heuristic to skip SIMD processing when it's not beneficial.
The library stores parsed data in flat arrays for cache efficiency and lower memory usage compared to DOM-based parsers.
XPath queries are optimized using pre/post-order numbering and inverted posting lists for fast traversal.
Parallel parsing is implemented for large XML files by identifying safe split points between elements.
simdxml includes features like Bloom filter prescan and lazy index construction to optimize query performance.
Benchmarks show simdxml outperforming pugixml and xmllint in parsing and XPath evaluation, especially for large files.
The library is available as a Rust crate, Python package, DuckDB extension, and CLI tool.

Hasty Briefsbeta