Dark Corners of Unicode (2015)

13 days ago

Copy Link

Unicode is a complex system designed to represent all human languages, but it's often misunderstood, even by programmers.
Unicode includes characters beyond ASCII, such as emoji, and uses codepoints to represent them.
UTF-8 is an encoding that can represent all Unicode codepoints, unlike ASCII which is limited to 128 characters.
Unicode characters can be composed of multiple codepoints, like combining diacritical marks or emoji sequences.
Sorting and comparing text in Unicode is non-trivial due to language-specific rules and normalization issues.
Unicode normalization can decompose characters, but this doesn't solve all problems, especially with non-Latin scripts.
Terminal and font rendering of Unicode characters, especially emoji and combining characters, can be inconsistent.
JavaScript and other languages have limitations in handling Unicode, especially with astral plane characters.
MySQL's 'utf8' encoding is limited to 3 bytes per character, causing issues with astral plane characters.
Emoji are not a formal Unicode block but are scattered across various blocks and defined by usage.
Unicode includes many interesting and obscure characters, from control pictures to alchemical symbols.

Hasty Briefsbeta