GB2312 Encoding for GB2312 Character Set

Unicode Tutorials - Herong's Tutorial Examples

∟GB2312 Encoding for GB2312 Character Set

This section provides a quick introduction of the GB2312 encoding for the GB2312 character set. GB2312 is a 2-byte (8 bits per bytes) encoding.

GB2312 encoding is the main encoding for the GB2312 character set. GB2312 encoding is based on native code values of GB2312 characters.

The native code value of each GB2312 character contains 2 bytes. The first byte is called the high byte, containing the row number plus 32; the second byte is called the low byte, containing the column number plus 32. For example, if a character is located at row 16 and column 1, its high byte will be 16 + 32 = 48 (0x30), and log byte will be 1 + 32 = 33 (0x21). Put them together, its native code value will be 0x3021.

I guess that the reason to add 32 on both the row number and the column number is for the byte value to not fall into the low value range, which is usually reserved to represent controlling commands in many computer systems.

However, byte values of GB2312 native codes are not directly used as GB2312 encoding byte sequences, because they are still colliding with ASCII encoding types. To resolve this problem, a value of 128 is added to both bytes of native codes. For example, if a character is located at row 16 and column 1, its native code will be 0x3021, and its modified code will be 0xB0A1.

These modified codes are adopted as the GB2312 encoding, which can be safely mixed together with the ASCII encoding.

GB2312 encoding is also called EUC-CN (Extended Unix Code for China).

GB2312 character set has another encoding called HZ, which maps each GB2312 character to 2 7-bit bytes uses ~{...~} to separate GB2312 characters from ASCII characters.