Herong's Tutorial Notes on Unicode
Dr. Herong Yang, Version 4.02

GB18030 Character Set and Encoding

History of GB Character Set

GB: An abbreviation of Guojia Biaozhun, or Buo Biao, meaning "national standard" in Chinese.

GB2312-1980: A coded character set and encoding scheme established by the government of People's Republic of China (PRC) in 1980. GB2312-1980 contains 7445 characters, including 6763 Hanzi and 682 non-Hanzi characters.

GB1300.1: A coded character set and encoding scheme established by PRC in 1993 for Hanzi characters. GB1300.1 is designed to be compatible with Unicode 2.1. by maintaining all characters in GB2312-1980 untoched, and positioning all additional characters defined in the Unified Han portion of Unicode 2.1 around the GB2312-1980 character set. GB1300.1 is also called Guojia Biaozhun Kuozhan (GBK). It defines 23940 code points containing 21886 characters.

GB18030-2000: A coded character set and encoding scheme established by PRC as an update of GB1300.1 to be compatible with Unicode 3.0. GB18030-2000 has 1.6 million valid code points, 0.5 million more than Unicode 3.0.

The government of PRC has required since September 1, 2001 that all operating systesm on non-handheld computers sold in PRC must comply with the GB18030-2000 standard.

GB18030-2000 Encoding

GB18030-2000 encoding scheme uses one, two or four bytes to encode a character. The following table shows the ranges of valid byte sequences:

Number Of   Valid Range
Bytes       Byte 1        Byte 2        Byte 3        Byte 4
   1        0x00 - 0x7F
   2        0x81 - 0xFE   0x40 - 0x7E
   2        0x81 - 0xFE   0x80 - 0xFE
   4        0x81 - 0xFE   0x30 - 0x39   0x81 - 0xFE   0x30 - 0x39

Processing a GB18030 encoded byte stream from the beginning of the stream is easy. Here is is an algorithm to divide the stream into sequences of bytes that represent valid GB18030 characters:

Input: 
   byte stream in
Algorithm: 
   while (in.hasNext())
      b1 = in.nextByte()
      if (0x00 <= b1 <= 0x7F)
         b1 is a valid byte sequence
      else if (0x81 <= b1 <= 0xFE)
         b2 = in.nextByte()
         if (0x40 <= b2 <= 0x7E || 0x80 <= b2 <= 0xFE) 
            b1, b2 is a valid byte sequence
         else if (0x30 <= b2 <= 0x39)
            b3 = in.nextByte()
            b4 = in.nextByte()
            b1, b2, b3, b4 is a valid byte sequence
         else
            stream is corrupted
         end if
      else
         stream is corrupted
      end if
Dr. Herong Yang, updated in 2007
Herong's Tutorial Notes on Unicode - GB18030 Character Set and Encoding