Decoding UTF-8. Part III: Determining Sequence Length – A Lookup Table
8 days ago
- #UTF-8
- #branchless
- #lookup-table
- The article discusses using a lookup table to determine the length of UTF-8 sequences without branching.
- UTF-8 lead bytes can be mapped to sequence lengths using a 256-byte lookup table, where each index corresponds to a byte value.
- One-byte sequences (0x00-0x7F) have a length of 1.
- Two-byte sequences (0xC0-0xDF) have a length of 2, except for overlong sequences (0xC0-0xC1), which are invalid.
- Three-byte sequences (0xE0-0xEF) have a length of 3.
- Four-byte sequences (0xF0-0xF4) have a length of 4, but sequences implying code points above U+10FFFF (0xF5-0xFF) are invalid.
- Continuation bytes (0x80-0xBF) and invalid lead bytes (0xF8-0xFF) are marked with a length of 0.
- The lookup table is implemented using a hard-coded array, with designated initializers for ranges (a GNU extension).
- The assembly code shows efficient access to the lookup table, but it adds a 256-byte array to the binary, potentially affecting caching.
- Future posts will explore branchless methods without a lookup table.