Hasty Briefsbeta

Bilingual

Unicode Footguns in Python

6 months ago
  • #Python
  • #Unicode
  • #Text Processing
  • Unicode characters can appear the same but have different underlying code points, known as canonical equivalence.
  • Python's unicodedata.normalize() function helps standardize strings for accurate comparison by converting them to NFC (composed form) or NFD (decomposed form).
  • String length in Python counts code points, not visual characters, requiring normalization or grapheme clustering for accurate visual representation.
  • Invisible characters like Zero Width Space (U+200B) can cause parsing issues and are revealed using repr() instead of print().
  • Homographic attacks exploit Unicode's visual similarity to deceive users, such as using Cyrillic letters that look like Latin ones in phishing domains.