GB18030 Character Set and Encoding
History of GB Character Set
GB: An abbreviation of Guojia Biaozhun, or Buo Biao, meaning
"national standard" in Chinese.
GB2312-1980: A coded character set and encoding scheme established
by the government of People's Republic of China (PRC) in 1980. GB2312-1980 contains
7445 characters, including 6763 Hanzi and 682 non-Hanzi characters.
GB1300.1: A coded character set and encoding scheme established
by PRC in 1993 for Hanzi characters. GB1300.1 is designed to be compatible
with Unicode 2.1. by maintaining all characters in GB2312-1980 untoched, and
positioning all additional characters defined in the Unified Han portion of
Unicode 2.1 around the GB2312-1980 character set. GB1300.1 is also called
Guojia Biaozhun Kuozhan (GBK). It defines 23940 code points containing 21886
characters.
GB18030-2000: A coded character set and encoding scheme established
by PRC as an update of GB1300.1 to be compatible with Unicode 3.0.
GB18030-2000 has 1.6 million valid code points, 0.5 million more than
Unicode 3.0.
The government of PRC has required since September 1, 2001 that all
operating systesm on non-handheld computers sold in PRC must comply with
the GB18030-2000 standard.
GB18030-2000 Encoding
GB18030-2000 encoding scheme uses one, two or four bytes to encode
a character. The following table shows the ranges of valid byte
sequences:
Number Of Valid Range
Bytes Byte 1 Byte 2 Byte 3 Byte 4
1 0x00 - 0x7F
2 0x81 - 0xFE 0x40 - 0x7E
2 0x81 - 0xFE 0x80 - 0xFE
4 0x81 - 0xFE 0x30 - 0x39 0x81 - 0xFE 0x30 - 0x39
Processing a GB18030 encoded byte stream from the beginning of the stream
is easy. Here is is an algorithm to divide the stream into sequences of bytes
that represent valid GB18030 characters:
Input:
byte stream in
Algorithm:
while (in.hasNext())
b1 = in.nextByte()
if (0x00 <= b1 <= 0x7F)
b1 is a valid byte sequence
else if (0x81 <= b1 <= 0xFE)
b2 = in.nextByte()
if (0x40 <= b2 <= 0x7E || 0x80 <= b2 <= 0xFE)
b1, b2 is a valid byte sequence
else if (0x30 <= b2 <= 0x39)
b3 = in.nextByte()
b4 = in.nextByte()
b1, b2, b3, b4 is a valid byte sequence
else
stream is corrupted
end if
else
stream is corrupted
end if
|