Herong's Tutorial Notes on Unicode
Dr. Herong Yang, Version 4.02

Unicode Character Set

What Is Unicode

Unicode is a character encoding standard used for representation of text for computer processing, and fully compatible with ISO/IEC 10646 standard.

The Unicode Standard provides the capacity to encode all of the characters used for the written languages of the world and special symbols.

Glyph: A visual representation of one fundamental element of written languages or symbols printed on paper or screen. For example, the letter "Z" is a glyph; and the letter "a" with "^" on top is another glyph.

Glyph is a measurement of shaps and forms of a language. Glyphs are font dependent, the same letter can be display with different glyphs, if different fonts is used.

Code Element: A digital representation of one fundamental semantic value of written languages and symbols. For example, the letter "Z", no matter how many different forms and shap it can be printed on parer or screen, it only has one semantic value - the capital letter "Z", so one code element. Another example is the letter "a" with "^" on top. Semaintically, it may be represented by two values: the samll letter "a" and the circumflex accent "^", so two code elements.

Character: Same as code element in Unicode context. But in other contexts, a character might be a larger measurement. For example, the letter "a" with "^" on top, is called one character in many non-Unicode contexts, but it is represented by two code elements, or two characters in Unicode context.

Code Point: A number assigned to a code element, usually represented in hexadecimal form with the prefix of "U+". For example, the code point of code element for the letter "Z" in Latin language is U+005A; and the code point of code element for the circumflex accent is U+0302.

Code Name: A name assigned to a code element. The code names are compatible with the character names defined in ISO/IEC 10646.

Unicode: A database of code points and code names assigned to code elements in all written languages and symbols.

Equivalent Sequences: Sequences of code elements that represent the same semantic value. For example, the samll letter "a" and the circumflex accent "^" can be represented by a single code element: U+00E2 (LATIN SMALL LETTER A WITH CIRCUMFLEX). It can also be represented by a sequence of twe code elements: U+0061 (LATIN SMALL LETTER A) and U+0302 (COMBINING CIRCUMFLEX ACCENT). So U+00E2 and U+0061U+0302.

Examples of code names:

Code   Code                        Code
Point  Name                        Element
U+005A LATIN CAPITAL LEETER Z      Capital letter "Z" in Latin
U+0061 LATIN SMALL LETTER A        Small letter "a" in Latin
U+0302 COMBINING CIRCUMFLEX ACCENT Circumflex accent on top of other 
                                   letters
U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX
                                   Small letter "a" with "^" on top

Comparing with other character coding standard, Unicode has the following unique features:

  • Full 16-bit coding. Each code is 16-bit number. No restriction. None of the 16 bits is reserved for any sepecial purposes.
  • Big enough to handle all existing written langauages and symbols. 16 bits gives 65536 code values. It can be expended to paired 16-bit codes to cover millions of code values.
  • Characters in the same language are coded in groups and ordered according their natural sequence whenever it's possible.
  • No escape sequences. No shift states.
  • Common characters (letters) in languages are unified into code element. The biggest example is the unification of Chinese/Japanese/Corean (CJK) ideographs into one common set of code elements.
Dr. Herong Yang, updated in 2007
Herong's Tutorial Notes on Unicode - Unicode Character Set