Hasty Briefsbeta

Dark Corners of Unicode (2015)

13 days ago
  • #programming
  • #text-rendering
  • #unicode
  • Unicode is a complex system designed to represent all human languages, but it's often misunderstood, even by programmers.
  • Unicode includes characters beyond ASCII, such as emoji, and uses codepoints to represent them.
  • UTF-8 is an encoding that can represent all Unicode codepoints, unlike ASCII which is limited to 128 characters.
  • Unicode characters can be composed of multiple codepoints, like combining diacritical marks or emoji sequences.
  • Sorting and comparing text in Unicode is non-trivial due to language-specific rules and normalization issues.
  • Unicode normalization can decompose characters, but this doesn't solve all problems, especially with non-Latin scripts.
  • Terminal and font rendering of Unicode characters, especially emoji and combining characters, can be inconsistent.
  • JavaScript and other languages have limitations in handling Unicode, especially with astral plane characters.
  • MySQL's 'utf8' encoding is limited to 3 bytes per character, causing issues with astral plane characters.
  • Emoji are not a formal Unicode block but are scattered across various blocks and defined by usage.
  • Unicode includes many interesting and obscure characters, from control pictures to alchemical symbols.