|
Unicode Character Set
What Is Unicode
Unicode is a character encoding standard used for representation of text for
computer processing, and fully compatible with ISO/IEC 10646 standard.
The Unicode Standard provides the capacity to encode all of the characters
used for the written languages of the world and special symbols.
Glyph: A visual representation of one fundamental element of written
languages or symbols printed on paper or screen. For example, the letter "Z"
is a glyph; and the letter "a" with "^" on top is another glyph.
Glyph is a measurement of shaps and forms of a language.
Glyphs are font dependent, the same letter can be display with different glyphs,
if different fonts is used.
Code Element: A digital representation of one fundamental semantic value
of written languages and symbols. For example, the letter "Z", no matter how
many different forms and shap it can be printed on parer or screen, it only
has one semantic value - the capital letter "Z", so one code element.
Another example is the letter "a" with "^" on top. Semaintically, it may be
represented by two values: the samll letter "a" and the circumflex accent "^",
so two code elements.
Character: Same as code element in Unicode context. But in other contexts,
a character might be a larger measurement. For example, the letter "a" with "^"
on top, is called one character in many non-Unicode contexts, but it is represented
by two code elements, or two characters in Unicode context.
Code Point: A number assigned to a code element, usually represented in
hexadecimal form with the prefix of "U+". For example, the code point of code
element for the letter "Z" in Latin language is U+005A; and the code point of
code element for the circumflex accent is U+0302.
Code Name: A name assigned to a code element. The code names are compatible
with the character names defined in ISO/IEC 10646.
Unicode: A database of code points and code names assigned to code elements
in all written languages and symbols.
Equivalent Sequences: Sequences of code elements that represent the same
semantic value. For example, the samll letter "a" and the circumflex accent "^"
can be represented by a single code element: U+00E2 (LATIN SMALL LETTER A WITH
CIRCUMFLEX). It can also be represented by a sequence of twe code elements:
U+0061 (LATIN SMALL LETTER A) and U+0302 (COMBINING CIRCUMFLEX ACCENT). So U+00E2
and U+0061U+0302.
Examples of code names:
Code Code Code
Point Name Element
U+005A LATIN CAPITAL LEETER Z Capital letter "Z" in Latin
U+0061 LATIN SMALL LETTER A Small letter "a" in Latin
U+0302 COMBINING CIRCUMFLEX ACCENT Circumflex accent on top of other
letters
U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX
Small letter "a" with "^" on top
Comparing with other character coding standard, Unicode has the following unique
features:
- Full 16-bit coding. Each code is 16-bit number. No restriction. None of the
16 bits is reserved for any sepecial purposes.
- Big enough to handle all existing written langauages and symbols. 16 bits gives
65536 code values. It can be expended to paired 16-bit codes to cover millions of
code values.
- Characters in the same language are coded in groups and ordered according their
natural sequence whenever it's possible.
- No escape sequences. No shift states.
- Common characters (letters) in languages are unified into code element. The biggest
example is the unification of Chinese/Japanese/Corean (CJK) ideographs into one common
set of code elements.
|