Hasty Briefsbeta

RFC 9839 and Bad Unicode

18 hours ago
  • #Text Encoding
  • #Unicode
  • #RFC 9839
  • Unicode text fields should use UTF-8 encoding but exclude problematic characters.
  • RFC 9839 identifies and categorizes problematic Unicode characters, offering three subsets for safer usage.
  • Problematic characters include U+0000 (null), U+0089 (control code), unpaired surrogates (e.g., U+DEAD), and noncharacters (e.g., U+7FFFF).
  • PRECIS (RFC 8264) provides a comprehensive framework but is complex and ties to specific Unicode versions.
  • RFC 9839 is simpler and more practical for excluding problematic characters in new protocols and data structures.
  • A Go library is available for validating text fields against RFC 9839 subsets.
  • Comparison of problematic character exclusions across data formats (CBOR, JSON, XML, etc.) is provided.
  • Acknowledgments highlight collaborative improvements to RFC 9839.
  • Individual submissions to IETF are labor-intensive; working groups are recommended for standardization efforts.