Unicode's confusables.txt and NFKC normalization disagree on 31 characters

6 hours ago

Homoglyph attacks exploit visually identical characters from different scripts, like Cyrillic 'а' vs. Latin 'a'.
Unicode's confusables.txt provides a defense by mapping ~6,565 characters to their visual equivalents.
NFKC normalization collapses compatibility variants (e.g., fullwidth letters → ASCII) and is recommended for slug validation.
31 characters in confusables.txt conflict with NFKC normalization, mapping to different Latin letters/digits.
Example: Long S (��) is mapped to 'f' by confusables.txt (visual) but to 's' by NFKC (semantic).
Capital I variants (16) are mapped to 'l' by confusables.txt but normalize to 'i' via NFKC.
Digit 0 variants (7) are mapped to 'o' by confusables.txt but normalize to '0' via NFKC.
Digit 1 variants (7) are mapped to 'l' by confusables.txt but normalize to '1' via NFKC.
The conflict arises because confusables.txt focuses on visual resemblance, while NFKC focuses on semantic meaning.
A naive pipeline combining both may have dead code (31 entries) or incorrect results if stages are reversed.
Solution: Filter confusables.txt to exclude NFKC-handled characters, reducing entries from ~6,565 to ~613.
Recommended pipeline: NFKC → NFKC-aware confusable map → mixed-script rejection.
Documentation gap: Unicode standards don't address the interaction between confusables.txt and NFKC.
A generator script automates creating an NFKC-aware confusable map, ensuring reproducibility across Unicode versions.

Hasty Briefsbeta