simdxml: structural indexing for XML
a day ago
- #XML
- #parsing
- #SIMD
- simdxml applies SIMD techniques to XML parsing, inspired by simdjson, to speed up processing.
- The library uses a two-pass approach: SIMD classification of structural characters followed by sequential index-building.
- XML's complexity, including nested quotes and attribute-heavy structures, makes SIMD parsing more challenging than JSON.
- simdxml employs quote masking with carry-less multiplication to handle XML's quoting rules efficiently.
- For text-heavy XML, simdxml uses a heuristic to skip SIMD processing when it's not beneficial.
- The library stores parsed data in flat arrays for cache efficiency and lower memory usage compared to DOM-based parsers.
- XPath queries are optimized using pre/post-order numbering and inverted posting lists for fast traversal.
- Parallel parsing is implemented for large XML files by identifying safe split points between elements.
- simdxml includes features like Bloom filter prescan and lazy index construction to optimize query performance.
- Benchmarks show simdxml outperforming pugixml and xmllint in parsing and XPath evaluation, especially for large files.
- The library is available as a Rust crate, Python package, DuckDB extension, and CLI tool.