Decoding UTF-8. Part III: Determining Sequence Length – A Lookup Table

8 days ago

Copy Link

The article discusses using a lookup table to determine the length of UTF-8 sequences without branching.
UTF-8 lead bytes can be mapped to sequence lengths using a 256-byte lookup table, where each index corresponds to a byte value.
One-byte sequences (0x00-0x7F) have a length of 1.
Two-byte sequences (0xC0-0xDF) have a length of 2, except for overlong sequences (0xC0-0xC1), which are invalid.
Three-byte sequences (0xE0-0xEF) have a length of 3.
Four-byte sequences (0xF0-0xF4) have a length of 4, but sequences implying code points above U+10FFFF (0xF5-0xFF) are invalid.
Continuation bytes (0x80-0xBF) and invalid lead bytes (0xF8-0xFF) are marked with a length of 0.
The lookup table is implemented using a hard-coded array, with designated initializers for ranges (a GNU extension).
The assembly code shows efficient access to the lookup table, but it adds a 256-byte array to the binary, potentially affecting caching.
Future posts will explore branchless methods without a lookup table.

Hasty Briefsbeta