Let's See Paul Allen's SIMD CSV Parser
a day ago
- #CSV Parsing
- #Performance Optimization
- #SIMD
- SIMD (Single Instruction, Multiple Data) allows processing multiple data points simultaneously, improving performance by avoiding branch operations.
- The simdjson paper is a key resource for understanding SIMD techniques, particularly for JSON parsing, which can process 64 bytes at a time.
- CSV parsing involves classifying structural characters (commas, quotes, newlines), filtering out those within quoted fields, and then using the remaining delimiters to split the data into rows and fields.
- Vectorized classification uses lookup tables based on nibbles (4-bit segments of a byte) to classify characters efficiently without branching.
- Bitmasking compresses classified bytes into bitmasks for each structural character type, enabling efficient processing of large data chunks.
- Filtering 'fake' delimiters involves using a prefix XOR on the quote bitmask to determine if a position is inside or outside a quoted field, then masking out delimiters within quotes.
- The final step collects field and row boundaries by iterating through cleaned bitmasks, using operations like counting leading zeros to find delimiter positions.