Hasty Briefsbeta

Bilingual

Unicode's confusables.txt and NFKC normalization disagree on 31 characters

6 hours ago
  • #Normalization
  • #Unicode
  • #Security
  • Homoglyph attacks exploit visually identical characters from different scripts, like Cyrillic 'а' vs. Latin 'a'.
  • Unicode's confusables.txt provides a defense by mapping ~6,565 characters to their visual equivalents.
  • NFKC normalization collapses compatibility variants (e.g., fullwidth letters → ASCII) and is recommended for slug validation.
  • 31 characters in confusables.txt conflict with NFKC normalization, mapping to different Latin letters/digits.
  • Example: Long S (��) is mapped to 'f' by confusables.txt (visual) but to 's' by NFKC (semantic).
  • Capital I variants (16) are mapped to 'l' by confusables.txt but normalize to 'i' via NFKC.
  • Digit 0 variants (7) are mapped to 'o' by confusables.txt but normalize to '0' via NFKC.
  • Digit 1 variants (7) are mapped to 'l' by confusables.txt but normalize to '1' via NFKC.
  • The conflict arises because confusables.txt focuses on visual resemblance, while NFKC focuses on semantic meaning.
  • A naive pipeline combining both may have dead code (31 entries) or incorrect results if stages are reversed.
  • Solution: Filter confusables.txt to exclude NFKC-handled characters, reducing entries from ~6,565 to ~613.
  • Recommended pipeline: NFKC → NFKC-aware confusable map → mixed-script rejection.
  • Documentation gap: Unicode standards don't address the interaction between confusables.txt and NFKC.
  • A generator script automates creating an NFKC-aware confusable map, ensuring reproducibility across Unicode versions.