GB2312 Tutorials - Herong's Tutorial Examples - v4.04, by Herong Yang
GB2312 Encodings
GB2312 Encoding transforms GB2312 Native Codes to the 0x8080-0xFFFF range to reserve 7-bit byte values for ASCII Codes. HZ and ISO-2022-CN Encodings uses escape sequences to switch between GB2312 Native Codes and ASCII Codes.
In order to resolve the incompatibility problem of GB2312 Native Codes and ASCII Code, several encoding schemas have been developed over the years:
Here are more detailed descriptions of these encodings:
1. What Is GB2312 Encoding? - GB2312 Encoding is an encoding that transform GB2312 Native Codes to the 0x8080-0xFFFF range to reserve 7-bit byte values for ASCII Codes. This is done by adding 0x80 to the high byte and the low byte of a GB2312 Native Code.
For example, the Chinese character 啊 has a GB2312 Native Code of 0x3021. Its GB2312 Encoding will be 0xB0A1, because 0x30 + 0x80 = 0xB0, and 0x21 + 0x80 = 0xA1.
GB2312 Encoding does resolve the incompatibility problem with ASCII Codes nicely. But the resulting byte sequence will have 8-bit byte values which are not safe to be transmitted over computer networks.
2. What Is HZ Encoding? - HZ Encoding is an encoding designed in 1989 by Fung Fung Lee that uses "~{" and "~}" to group and identify GB2312 Native Codes from ASCII Codes.
The advantage of HZ Encoding is that the resulting byte sequence only have 7-bit bytes, still safe to be transmitted over computer networks. But the extra grouping sequences "~{" and "~}" may cause processing trouble.
For example, "2015~{Dj~} 1~{TB~} 1~{HU~}" is the HZ Encoding of "2015年 1月 1日".
3. What Is ISO-2022-CN Encoding? - ISO-2022-CN Encoding is an encoding developed as part of the ISO-2022 standard to include multiple character sets in a single character encoding system, using difference escape sequences to switch to different character sets.
When using ISO-2022-CN Encoding to mix GB2312 Native Codes with ASCII Codes, you need use "ESC $ ) A" escape sequence to start GB2312 Native Codes, and "ESC ( B" escape sequence to start ASCII Codes.
Similar to HZ Encoding, ISO-2022-CN Encoding is safe to be transmitted over computer networks. But its escape sequences are much heavier than HZ Encoding.
For example, "<ESC>(B2015<ESC>$)ADj<ESC>(B 1<ESC>$)ATB<ESC> 1<ESC>$)AHU" is the ISO-2022-CN Encoding of "2015年 1月 1日".
Out of those 3 Encodings, GB2312 Encoding is more commonly used.
Now we have learned that a character in the GB2312 character set can be identified or represented in 3 ways:
A list of all GB2312 characters and their Location Codes, GB2312 Encodings will be provided later in this book.
Table of Contents
GB2312 Location Codes and Native Codes
GB2312Unicode.java - GB2312 to Unicode Mapping
GB2312 to Unicode Mapping - Non-Chinese Characters
GB2312 to Unicode Mapping - Level 1 Characters
GB2312 to Unicode Mapping - Level 2 Characters
UnicodeGB2312.java - Unicode to GB2312 Mapping
Unicode to GB2312 Mapping - All 7,445 Characters