What Every Programmer Positively Needs to Know About Encodings ...
3 days ago
- #encoding
- #programming
- #unicode
- Encodings are essential for handling text in computers, even for simple tasks like sending emails.
- ASCII is a basic encoding scheme using 7 bits per character, covering 128 characters including English letters, numbers, and some symbols.
- Extended encodings like ISO-8859-1 use 8 bits to cover additional European characters, but still can't represent all languages.
- Multi-byte encodings like GB18030 and BIG-5 use two bytes per character to support languages with thousands of characters, such as Chinese.
- Unicode is a universal standard that aims to cover all characters from all languages, with code points for over a million characters.
- UTF-8, UTF-16, and UTF-32 are Unicode encoding schemes, with UTF-8 being backward compatible with ASCII and widely used for its efficiency.
- Garbled text occurs when the wrong encoding is used to interpret a byte sequence, emphasizing the need to specify or detect the correct encoding.
- PHP handles strings as byte sequences without native Unicode support, requiring careful use of functions to avoid breaking multi-byte characters.
- The Multibyte String extension in PHP provides functions that are aware of multi-byte characters, necessary for correct string manipulation in UTF-8.
- Best practices include using UTF-8 as the standard encoding, converting other encodings to UTF-8 upon input, and ensuring consistent encoding across systems.