Herong's Tutorial Notes on Unicode
Dr. Herong Yang, Version 4.02

JIS X0208 Character Set and Encodings

This tutorial helps you understand:

  • JIS X0208 Character Set
  • EUC-JP Encoding
  • ISO-2022-JP Encoding
  • Shift-JIS Encoding

JIS X0208 Character Set

JIS X0208: A coded character set established for Japanese in 1990. JIS stands for Japanese Industrial Standards.

JIS X0208 arranges characters into a matrix of 94 rows and 94 columns. The rows are called quwei, and are organized as follows:

Rows     # of 
Qu Wei   Chars   Characters
01-02            Punctuation, symbols
03               ISO 646 (alphanumerics only)
04               Hiragana
05               Katakana
06               Greek
07               Cyrillic
08               Line drawing
16-47    2965    Kanji level 1, ordered by on-yomi
48-83    3384    Kanji level 2, ordered by Kangxi radical, then stroke
84          6    Miscellaneous kanji

There are four sub character sets used in writing modern Japanese: katakana, hiragana, kanji, and romaji.

Katakana contains 46 characters, with very angular strokes. Each Katakana character reprensents a unique sound. There are 5 vowels in them. Katakan characters can be used to express any sound in the Japanese language.

Katakana is like Pinyin in Chinese. It is commonly used to express foreign names.

Hiragana contains 46 characters, with very smoother strokes. Each Hiragana character represents a unique sound. Hiragana characters can be used to express any sound in the Japanese language. Hiragana is closely related to Katakana. In fact, each Hiragana character has a counterpart in Katakana.

Hiragana is commonly used to express simple words. It is also the first writing system taught to Japanese children.

Kanji contains thousands of Chinese characters, that were brought to Japan many years ago.

Romaji contains the Roman alphabets. Of course, they are used to express foreign words.

EUC-JP Encoding

EUC-JP: An encoding for JIS X0208 character set. It is an 8-bit encoding with 1 to 2 bytes per character:

Number Of   Valid Range
Bytes       Byte 1        Byte 2       
   1        0x21 - 0x7F
   2        0xA1 - 0xFE   0xA1 - 0xFE

ISO-2022-JP Encoding

ISO-2022-JP: An encoding for JIS X0208 character set. It is a 7-bit encoding with 1 to 2 bytes per character:

Number Of   Valid Range
Bytes       Byte 1        Byte 2       
   1        0x21 - 0x7F
   2        0x21 - 0x7E   0x21 - 0x7E

Escape sequences are used to switch between ASCII mode and JIS modes:

ASCII_mode <Esc>$B JIS_mode <Esc>(J ASCII_mode

We all know that the advantage of a 7-bit encoding is the compatibility to old email systems.

When an JIS character is in JIS-mode, its code point (a 2-byte interger) will be converted into 2 bytes by turning the leading bit to 0 on each original byte. If you need algorithm, try this:

Input: 
   unsigned integer c - the code point of the character to be encoded
Output: 
   byte b1, b2 - the encoded sequence of bytes
Algorithm:
   if (c<0x80) 
      b1 = c>>0  & 0x7F | 0x00
      b2 = null
   else if (c<0x0800)
      b1 = c>>8  & 0x7F | 0x00
      b2 = c>>0  & 0x7F | 0x00
   end if

Shift-JIS Encoding

Shift-JIS: An encoding for JIS X0208 character set. It is a 8-bit encoding with 1 to 2 bytes per character:

Number Of   Valid Range
Bytes       Byte 1        Byte 2       
   1        0x21 - 0x7F	(for ASCII)
   1        0xA1 - 0xDF (for Katakana)
   2        0x81 - 0x9F   0x40 - 0x7E
   2        0xE0 - 0xEF   0x80 - 0xFC

Shift-JIS is a Microsoft standard (codepage 932). I don't see a simple encoding algorithm yet. Home someone can help on this.

Dr. Herong Yang, updated in 2007
Herong's Tutorial Notes on Unicode - JIS X0208 Character Set and Encodings