|
JIS X0208 Character Set and Encodings
This tutorial helps you understand:
- JIS X0208 Character Set
- EUC-JP Encoding
- ISO-2022-JP Encoding
- Shift-JIS Encoding
JIS X0208 Character Set
JIS X0208: A coded character set established for Japanese in 1990.
JIS stands for Japanese Industrial Standards.
JIS X0208 arranges characters into a matrix of 94 rows and 94 columns.
The rows are called quwei, and are organized as follows:
Rows # of
Qu Wei Chars Characters
01-02 Punctuation, symbols
03 ISO 646 (alphanumerics only)
04 Hiragana
05 Katakana
06 Greek
07 Cyrillic
08 Line drawing
16-47 2965 Kanji level 1, ordered by on-yomi
48-83 3384 Kanji level 2, ordered by Kangxi radical, then stroke
84 6 Miscellaneous kanji
There are four sub character sets used in writing modern Japanese: katakana, hiragana,
kanji, and romaji.
Katakana contains 46 characters, with very angular strokes.
Each Katakana character reprensents a unique sound. There are 5 vowels in them.
Katakan characters can be used to express any sound in the Japanese language.
Katakana is like Pinyin in Chinese. It is commonly used to express foreign names.
Hiragana contains 46 characters, with very smoother strokes.
Each Hiragana character represents a unique sound. Hiragana characters can be used
to express any sound in the Japanese language. Hiragana is closely related to Katakana.
In fact, each Hiragana character has a counterpart in Katakana.
Hiragana is commonly used to express simple words. It is also the first writing system
taught to Japanese children.
Kanji contains thousands of Chinese characters, that were brought to Japan
many years ago.
Romaji contains the Roman alphabets. Of course, they are used to express
foreign words.
EUC-JP Encoding
EUC-JP: An encoding for JIS X0208 character set. It is an 8-bit encoding with 1 to
2 bytes per character:
Number Of Valid Range
Bytes Byte 1 Byte 2
1 0x21 - 0x7F
2 0xA1 - 0xFE 0xA1 - 0xFE
ISO-2022-JP Encoding
ISO-2022-JP: An encoding for JIS X0208 character set. It is a 7-bit encoding with 1 to
2 bytes per character:
Number Of Valid Range
Bytes Byte 1 Byte 2
1 0x21 - 0x7F
2 0x21 - 0x7E 0x21 - 0x7E
Escape sequences are used to switch between ASCII mode and JIS modes:
ASCII_mode <Esc>$B JIS_mode <Esc>(J ASCII_mode
We all know that the advantage of a 7-bit encoding is the compatibility to old email
systems.
When an JIS character is in JIS-mode, its code point (a 2-byte interger) will be converted
into 2 bytes by turning the leading bit to 0 on each original byte. If you need algorithm,
try this:
Input:
unsigned integer c - the code point of the character to be encoded
Output:
byte b1, b2 - the encoded sequence of bytes
Algorithm:
if (c<0x80)
b1 = c>>0 & 0x7F | 0x00
b2 = null
else if (c<0x0800)
b1 = c>>8 & 0x7F | 0x00
b2 = c>>0 & 0x7F | 0x00
end if
Shift-JIS Encoding
Shift-JIS: An encoding for JIS X0208 character set. It is a 8-bit encoding with 1 to
2 bytes per character:
Number Of Valid Range
Bytes Byte 1 Byte 2
1 0x21 - 0x7F (for ASCII)
1 0xA1 - 0xDF (for Katakana)
2 0x81 - 0x9F 0x40 - 0x7E
2 0xE0 - 0xEF 0x80 - 0xFC
Shift-JIS is a Microsoft standard (codepage 932). I don't see a simple encoding algorithm
yet. Home someone can help on this.
|