Hasty Briefsbeta

Bilingual

Let's See Paul Allen's SIMD CSV Parser

a day ago
  • #CSV Parsing
  • #Performance Optimization
  • #SIMD
  • SIMD (Single Instruction, Multiple Data) allows processing multiple data points simultaneously, improving performance by avoiding branch operations.
  • The simdjson paper is a key resource for understanding SIMD techniques, particularly for JSON parsing, which can process 64 bytes at a time.
  • CSV parsing involves classifying structural characters (commas, quotes, newlines), filtering out those within quoted fields, and then using the remaining delimiters to split the data into rows and fields.
  • Vectorized classification uses lookup tables based on nibbles (4-bit segments of a byte) to classify characters efficiently without branching.
  • Bitmasking compresses classified bytes into bitmasks for each structural character type, enabling efficient processing of large data chunks.
  • Filtering 'fake' delimiters involves using a prefix XOR on the quote bitmask to determine if a position is inside or outside a quoted field, then masking out delimiters within quotes.
  • The final step collects field and row boundaries by iterating through cleaned bitmasks, using operations like counting leading zeros to find delimiter positions.