An Introduction to Writing Systems and Unicode
3 days ago
- #Unicode-Encoding
- #Chinese-Scripts
- #East-Asian-Languages
- Traditional Chinese was the original script, with Simplified Chinese introduced in the 1950s in Mainland China, simplifying character shapes and reducing the character set, often mapping multiple traditional characters to one simplified form.
- Traditional Chinese is used in Taiwan, Hong Kong, and the diaspora, while Simplified Chinese is used in Mainland China and Singapore; both scripts represent meaning rather than sound across diverse Chinese dialects.
- Chinese characters (hanzi) require about 3-4,000 for daily use, with Unicode supporting over 70,000; Japanese uses kanji (from Chinese), hiragana for native words and grammar, and katakana for loanwords, with about 2,000 kanji in common use.
- Japanese kana scripts include features like dakuten for voiced consonants, small tsu for consonant lengthening, and small versions of characters for combined syllables; katakana uses a lengthening mark for vowels.
- Korean uses hangul, a unique syllabic script where individual phonemes are combined into syllable blocks; it can be mixed with hanja (Chinese characters), though hangul alone is common, with about 2,300 characters in everyday use.
- Radicals are components used for indexing and creating ideographs, with 214 KangXi radicals recognized; Unicode has blocks for radicals and their variants, but these should not be used as ideographs.
- Character sets define needed characters for a script, coded character sets assign unique code points, and encodings map these to computer numbers; early systems used code pages with limited space, leading to issues like localization difficulties.
- Unicode provides a universal character set with over one million code points across multiple planes, supporting all scripts simultaneously and easing localization without code page switches; it includes private use areas for custom mappings.
- Unicode encodings include UTF-8 (variable byte length), UTF-16 (2 or 4 bytes), and UTF-32 (4 bytes), with UTF-8 recommended for web pages; proper handling of character boundaries is crucial to avoid issues like truncation or garbled text.
- Input methods like IMEs help enter East Asian characters, using strategies like phonetic transcription (e.g., romaji for Japanese, pinyin for Chinese) or visual composition (e.g., changjie); tools like UniView and Unibook assist with Unicode character lookup and management.