This section provides a quick introduction of the GB18030 encoding for the GB18030 character set. GB18030 is a multi-byte (1-byte, 2-byte, or 4-byte) encoding.
GB18030 encoding scheme uses one, two or four bytes to encode
a character. The following table shows the ranges of valid byte
sequences:
Processing a GB18030 encoded byte stream from the beginning of the stream
is easy. Here is an algorithm to divide the stream into sequences of bytes
that represent valid GB18030 characters:
Input:
byte stream in
Algorithm:
while (in.hasNext())
b1 = in.nextByte()
if (0x00 <= b1 <= 0x7F)
b1 is a valid byte sequence
else if (0x81 <= b1 <= 0xFE)
b2 = in.nextByte()
if (0x40 <= b2 <= 0x7E || 0x80 <= b2 <= 0xFE)
b1, b2 is a valid byte sequence
else if (0x30 <= b2 <= 0x39)
b3 = in.nextByte()
b4 = in.nextByte()
b1, b2, b3, b4 is a valid byte sequence
else
stream is corrupted
end if
else
stream is corrupted
end if