Unicode's confusables.txt and NFKC normalization disagree on 31 characters
6 hours ago
- #Normalization
- #Unicode
- #Security
- Homoglyph attacks exploit visually identical characters from different scripts, like Cyrillic 'а' vs. Latin 'a'.
- Unicode's confusables.txt provides a defense by mapping ~6,565 characters to their visual equivalents.
- NFKC normalization collapses compatibility variants (e.g., fullwidth letters → ASCII) and is recommended for slug validation.
- 31 characters in confusables.txt conflict with NFKC normalization, mapping to different Latin letters/digits.
- Example: Long S (��) is mapped to 'f' by confusables.txt (visual) but to 's' by NFKC (semantic).
- Capital I variants (16) are mapped to 'l' by confusables.txt but normalize to 'i' via NFKC.
- Digit 0 variants (7) are mapped to 'o' by confusables.txt but normalize to '0' via NFKC.
- Digit 1 variants (7) are mapped to 'l' by confusables.txt but normalize to '1' via NFKC.
- The conflict arises because confusables.txt focuses on visual resemblance, while NFKC focuses on semantic meaning.
- A naive pipeline combining both may have dead code (31 entries) or incorrect results if stages are reversed.
- Solution: Filter confusables.txt to exclude NFKC-handled characters, reducing entries from ~6,565 to ~613.
- Recommended pipeline: NFKC → NFKC-aware confusable map → mixed-script rejection.
- Documentation gap: Unicode standards don't address the interaction between confusables.txt and NFKC.
- A generator script automates creating an NFKC-aware confusable map, ensuring reproducibility across Unicode versions.