Unicode Footguns in Python

6 months ago

Unicode characters can appear the same but have different underlying code points, known as canonical equivalence.
Python's unicodedata.normalize() function helps standardize strings for accurate comparison by converting them to NFC (composed form) or NFD (decomposed form).
String length in Python counts code points, not visual characters, requiring normalization or grapheme clustering for accurate visual representation.
Invisible characters like Zero Width Space (U+200B) can cause parsing issues and are revealed using repr() instead of print().
Homographic attacks exploit Unicode's visual similarity to deceive users, such as using Cyrillic letters that look like Latin ones in phishing domains.

Hasty Briefsbeta