Unicode Footguns in Python
6 months ago
- #Python
- #Unicode
- #Text Processing
- Unicode characters can appear the same but have different underlying code points, known as canonical equivalence.
- Python's unicodedata.normalize() function helps standardize strings for accurate comparison by converting them to NFC (composed form) or NFD (decomposed form).
- String length in Python counts code points, not visual characters, requiring normalization or grapheme clustering for accurate visual representation.
- Invisible characters like Zero Width Space (U+200B) can cause parsing issues and are revealed using repr() instead of print().
- Homographic attacks exploit Unicode's visual similarity to deceive users, such as using Cyrillic letters that look like Latin ones in phishing domains.