Hasty Briefsbeta

Decoding UTF-8. Part III: Determining Sequence Length – A Lookup Table

8 days ago
  • #UTF-8
  • #branchless
  • #lookup-table
  • The article discusses using a lookup table to determine the length of UTF-8 sequences without branching.
  • UTF-8 lead bytes can be mapped to sequence lengths using a 256-byte lookup table, where each index corresponds to a byte value.
  • One-byte sequences (0x00-0x7F) have a length of 1.
  • Two-byte sequences (0xC0-0xDF) have a length of 2, except for overlong sequences (0xC0-0xC1), which are invalid.
  • Three-byte sequences (0xE0-0xEF) have a length of 3.
  • Four-byte sequences (0xF0-0xF4) have a length of 4, but sequences implying code points above U+10FFFF (0xF5-0xFF) are invalid.
  • Continuation bytes (0x80-0xBF) and invalid lead bytes (0xF8-0xFF) are marked with a length of 0.
  • The lookup table is implemented using a hard-coded array, with designated initializers for ranges (a GNU extension).
  • The assembly code shows efficient access to the lookup table, but it adds a 256-byte array to the binary, potentially affecting caching.
  • Future posts will explore branchless methods without a lookup table.