Herong's Tutorial Notes on Unicode - Character Sets and Encodings

Herong's Tutorial Notes on Unicode

Dr. Herong Yang, Version 4.02

Character Sets and Encodings

Definitions

Character Set: A collection of characters used in the a language, and/or symbols used in a symbolic system. Examples of character set: numberic numbers, alphabetical letters, and Chinese characters.

Coded Character Set: A character set in which each character has an assigned integral number. Examples of coded character set: US-ASCII, EBCDIC, ISO-8859-1, GB2312-1980, and Unicode. Note that:

If character set B is a super set of character set A, we say B is backward compatible with A.
Since we are only interested in coded character sets, so from now on I will use the term "character set" as "coded character set".

Code Point: An integral number assigned to a character in a coded character set.

Character Encoding: A map scheme between code points of a coded character set and sequences of bytes. Note that:

One coded character set may have many character encodings.
One coded character set must have at least one character encoding.

Commonly Used Character Sets and Encodings

The following table summaries some commonly used character sets and encodings:

Character      Encoding       # of    Byte    Language
Set                           Bytes   Type

ASCII          ASCII          1       7-bit   English
Latin1         ISO-8859-1     1       8-bit   Latin languages
GB2312-1980    GB             1-2     8-bit   Chinese
GB2312-1980    EUC-CN         1-2     8-bit   Chinese
GB2312-1980    HZ             1-2     7-bit   Chinese
GBK            GBK            1-2     8-bit   Chinese
GB18030-2000   GB18030-2000   1-4     8-bit   Chinese
Big5           Big5           1-2     8-bit   Chinese
CNS 11643-1992 EUC-TW         1-4     8-bit   Chinese
JIS            EUC-JP         1-2     8-bit   Japanese
JIS            ISO-2022-JP    1-2     7-bit   Japanese
JIS            Shift-JIS      1-2     8-bit   Japanese
KS             EUC-KR         1-2     8-bit   Korean
KS             ISO-2022-KR    1-2     7-bit   Korean
Unicode 3.0    UTF-7          1-3     8-bit   Multilingual
Unicode 3.0    UTF-8          1-3     8-bit   Multilingual
Unicode 3.0    UTF-16BE       2       8-bit   Multilingual
Unicode 3.0    UTF-16LE       2       8-bit   Multilingual
Unicode 3.1    UTF-8          1-4     8-bit   Multilingual

Dr. Herong Yang, updated in 2007

Herong's Tutorial Notes on Unicode - Character Sets and Encodings