Let's See Paul Allen's SIMD CSV Parser

a day ago

SIMD (Single Instruction, Multiple Data) allows processing multiple data points simultaneously, improving performance by avoiding branch operations.
The simdjson paper is a key resource for understanding SIMD techniques, particularly for JSON parsing, which can process 64 bytes at a time.
CSV parsing involves classifying structural characters (commas, quotes, newlines), filtering out those within quoted fields, and then using the remaining delimiters to split the data into rows and fields.
Vectorized classification uses lookup tables based on nibbles (4-bit segments of a byte) to classify characters efficiently without branching.
Bitmasking compresses classified bytes into bitmasks for each structural character type, enabling efficient processing of large data chunks.
Filtering 'fake' delimiters involves using a prefix XOR on the quote bitmask to determine if a position is inside or outside a quoted field, then masking out delimiters within quotes.
The final step collects field and row boundaries by iterating through cleaned bitmasks, using operations like counting leading zeros to find delimiter positions.

Hasty Briefsbeta