Herong's Tutorial Notes on Unicode
Dr. Herong Yang, Version 4.02

Character Sets and Encodings

Definitions

Character Set: A collection of characters used in the a language, and/or symbols used in a symbolic system. Examples of character set: numberic numbers, alphabetical letters, and Chinese characters.

Coded Character Set: A character set in which each character has an assigned integral number. Examples of coded character set: US-ASCII, EBCDIC, ISO-8859-1, GB2312-1980, and Unicode. Note that:

  • If character set B is a super set of character set A, we say B is backward compatible with A.
  • Since we are only interested in coded character sets, so from now on I will use the term "character set" as "coded character set".

Code Point: An integral number assigned to a character in a coded character set.

Character Encoding: A map scheme between code points of a coded character set and sequences of bytes. Note that:

  • One coded character set may have many character encodings.
  • One coded character set must have at least one character encoding.

Commonly Used Character Sets and Encodings

The following table summaries some commonly used character sets and encodings:

Character      Encoding       # of    Byte    Language
Set                           Bytes   Type  
ASCII          ASCII          1       7-bit   English
Latin1         ISO-8859-1     1       8-bit   Latin languages
GB2312-1980    GB             1-2     8-bit   Chinese
GB2312-1980    EUC-CN         1-2     8-bit   Chinese
GB2312-1980    HZ             1-2     7-bit   Chinese
GBK            GBK            1-2     8-bit   Chinese
GB18030-2000   GB18030-2000   1-4     8-bit   Chinese
Big5           Big5           1-2     8-bit   Chinese
CNS 11643-1992 EUC-TW         1-4     8-bit   Chinese
JIS            EUC-JP         1-2     8-bit   Japanese
JIS            ISO-2022-JP    1-2     7-bit   Japanese
JIS            Shift-JIS      1-2     8-bit   Japanese
KS             EUC-KR         1-2     8-bit   Korean
KS             ISO-2022-KR    1-2     7-bit   Korean
Unicode 3.0    UTF-7          1-3     8-bit   Multilingual
Unicode 3.0    UTF-8          1-3     8-bit   Multilingual
Unicode 3.0    UTF-16BE       2       8-bit   Multilingual
Unicode 3.0    UTF-16LE       2       8-bit   Multilingual
Unicode 3.1    UTF-8          1-4     8-bit   Multilingual
Dr. Herong Yang, updated in 2007
Herong's Tutorial Notes on Unicode - Character Sets and Encodings