Dark Corners of Unicode (2015)
13 days ago
- #programming
- #text-rendering
- #unicode
- Unicode is a complex system designed to represent all human languages, but it's often misunderstood, even by programmers.
- Unicode includes characters beyond ASCII, such as emoji, and uses codepoints to represent them.
- UTF-8 is an encoding that can represent all Unicode codepoints, unlike ASCII which is limited to 128 characters.
- Unicode characters can be composed of multiple codepoints, like combining diacritical marks or emoji sequences.
- Sorting and comparing text in Unicode is non-trivial due to language-specific rules and normalization issues.
- Unicode normalization can decompose characters, but this doesn't solve all problems, especially with non-Latin scripts.
- Terminal and font rendering of Unicode characters, especially emoji and combining characters, can be inconsistent.
- JavaScript and other languages have limitations in handling Unicode, especially with astral plane characters.
- MySQL's 'utf8' encoding is limited to 3 bytes per character, causing issues with astral plane characters.
- Emoji are not a formal Unicode block but are scattered across various blocks and defined by usage.
- Unicode includes many interesting and obscure characters, from control pictures to alchemical symbols.