Character Sets and Encodings
Definitions
Character Set: A collection of characters used in the a language, and/or
symbols used in a symbolic system. Examples of character set: numberic numbers,
alphabetical letters, and Chinese characters.
Coded Character Set: A character set in which each character has an
assigned integral number. Examples of coded character set: US-ASCII, EBCDIC,
ISO-8859-1, GB2312-1980, and Unicode. Note that:
- If character set B is a super set of character set A, we say B is backward
compatible with A.
- Since we are only interested in coded character sets, so from now on I will
use the term "character set" as "coded character set".
Code Point: An integral number assigned to a character in a coded character set.
Character Encoding: A map scheme between code points of a coded character
set and sequences of bytes. Note that:
- One coded character set may have many character encodings.
- One coded character set must have at least one character encoding.
Commonly Used Character Sets and Encodings
The following table summaries some commonly used character sets and encodings:
Character Encoding # of Byte Language
Set Bytes Type
ASCII ASCII 1 7-bit English
Latin1 ISO-8859-1 1 8-bit Latin languages
GB2312-1980 GB 1-2 8-bit Chinese
GB2312-1980 EUC-CN 1-2 8-bit Chinese
GB2312-1980 HZ 1-2 7-bit Chinese
GBK GBK 1-2 8-bit Chinese
GB18030-2000 GB18030-2000 1-4 8-bit Chinese
Big5 Big5 1-2 8-bit Chinese
CNS 11643-1992 EUC-TW 1-4 8-bit Chinese
JIS EUC-JP 1-2 8-bit Japanese
JIS ISO-2022-JP 1-2 7-bit Japanese
JIS Shift-JIS 1-2 8-bit Japanese
KS EUC-KR 1-2 8-bit Korean
KS ISO-2022-KR 1-2 7-bit Korean
Unicode 3.0 UTF-7 1-3 8-bit Multilingual
Unicode 3.0 UTF-8 1-3 8-bit Multilingual
Unicode 3.0 UTF-16BE 2 8-bit Multilingual
Unicode 3.0 UTF-16LE 2 8-bit Multilingual
Unicode 3.1 UTF-8 1-4 8-bit Multilingual
|