GB2312 Tutorials - Herong's Tutorial Examples - v4.04, by Herong Yang
GB2312Unicode.java - GB2312 to Unicode Mapping
GB2312Unicode.java is a Java program that generates a table to map all GB2312 characters from GB2312 Codes to Unicode Codes.
If we compare GB2312 codes with Unicode codes of same Chinese characters, we will not find any mathematical relations. So if someone wants to convert a Chinese text file from the GB2312 encoding to a Unicode encoding, he/she needs to use a big mapping table that covers all 7445 GB2312 characters.
If we search the Internet, we probably can copies of such mapping table in different formats.
But if you have JDK (Java Development Kit) installed on your computer, you build a GB2312 to Unicode mapping table yourself with a simple program.
Here is a Java program I wrote to build a GB2312 to Unicode mapping table, GB2312Unicode.java. The output of the program includes 5 columns per character:
/* GB2312Unicode.java - Copyright (c) 2015, HerongYang.com, All Rights Reserved. */ import java.io.*; import java.nio.*; import java.nio.charset.*; class GB2312Unicode { static OutputStream out = null; static char hexDigit[] = {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F'}; static int b_out[] = {201,267,279,293,484,587,625,657,734,782,827, 874,901,980,5590}; static int e_out[] = {216,268,280,294,494,594,632,694,748,794,836, 894,903,994,5594}; public static void main(String[] args) { try { out = new FileOutputStream("gb2312_unicode.gb"); writeCode(); out.close(); } catch (IOException e) { System.out.println(e.toString()); } } public static void writeCode() throws IOException { boolean reserved = false; String name = null; // GB2312 is not supported by JDK. So I am using GBK. CharsetDecoder gbdc = Charset.forName("GBK").newDecoder(); CharsetEncoder uxec = Charset.forName("UTF-16BE").newEncoder(); CharsetEncoder u8ec = Charset.forName("UTF-8").newEncoder(); ByteBuffer gbbb = null; ByteBuffer uxbb = null; ByteBuffer u8bb = null; CharBuffer cb = null; int count = 0; for (int i=1; i<=94; i++) { // Defining row settings if (i>=1 && i<=9) { reserved = false; name = "Graphic symbols"; } else if (i>=10 && i<=15) { reserved = true; name = "Reserved"; } else if (i>=16 && i<=55) { reserved = false; name = "Level 1 characters"; } else if (i>=56 && i<=87) { reserved = false; name = "Level 2 characters"; } else if (i>=88 && i<=94) { reserved = true; name = "Reserved"; } // writing row title writeln(); writeString("<p>"); writeNumber(i); writeString(" Row: "+name); writeln(); writeString("</p>"); writeln(); if (!reserved) { writeln(); writeHeader(); // looping through all characters in one row for (int j=1; j<=94; j++) { byte hi = (byte)(0xA0 + i); byte lo = (byte)(0xA0 + j); if (validGB(i,j)) { // getting GB, UTF-16BE, UTF-8 codes gbbb = ByteBuffer.wrap(new byte[]{hi,lo}); try { cb = gbdc.decode(gbbb); uxbb = uxec.encode(cb); cb.rewind(); u8bb = u8ec.encode(cb); } catch (CharacterCodingException e) { cb = null; uxbb = null; u8bb = null; } } else { cb = null; uxbb = null; u8bb = null; } writeNumber(i); writeNumber(j); writeString(" "); if (cb!=null) { writeByte(hi); writeByte(lo); writeString(" "); writeHex(hi); writeHex(lo); count++; } else { writeGBSpace(); writeString(" null"); } writeString(" "); writeByteBuffer(uxbb,2); writeString(" "); writeByteBuffer(u8bb,3); if (j%2 == 0) { writeln(); } else { writeString(" "); } } writeFooter(); } } System.out.println("Number of GB characters wrote: "+count); } public static void writeln() throws IOException { out.write(0x0D); out.write(0x0A); } public static void writeByte(byte b) throws IOException { out.write(b & 0xFF); } public static void writeByteBuffer(ByteBuffer b, int l) throws IOException { int i = 0; if (b==null) { writeString("null"); i = 2; } else { for (i=0; i<b.limit(); i++) writeHex(b.get(i)); } for (int j=i; j<l; j++) writeString(" "); } public static void writeGBSpace() throws IOException { out.write(0xA1); out.write(0xA1); } public static void writeString(String s) throws IOException { if (s!=null) { for (int i=0; i<s.length(); i++) { out.write((int) (s.charAt(i) & 0xFF)); } } } public static void writeNumber(int i) throws IOException { String s = "00" + String.valueOf(i); writeString(s.substring(s.length()-2,s.length())); } public static void writeHex(byte b) throws IOException { out.write((int) hexDigit[(b >> 4) & 0x0F]); out.write((int) hexDigit[b & 0x0F]); } public static void writeHeader() throws IOException { writeString("<pre>"); writeln(); writeString("Q.W. "); writeGBSpace(); writeString(" GB Uni. UTF-8 "); writeString(" "); writeString("Q.W. "); writeGBSpace(); writeString(" GB Uni. UTF-8 "); writeln(); writeln(); } public static void writeFooter() throws IOException { writeString("</pre>"); writeln(); } public static boolean validGB(int i,int j) { for (int l=0; l<b_out.length; l++) { if (i*100+j>=b_out[l] && i*100+j<=e_out[l]) return false; } return true; } }
The entire output of this program is included later in the book.
Table of Contents
►GB2312Unicode.java - GB2312 to Unicode Mapping
GB2312 to Unicode Mapping - Non-Chinese Characters
GB2312 to Unicode Mapping - Level 1 Characters
GB2312 to Unicode Mapping - Level 2 Characters
UnicodeGB2312.java - Unicode to GB2312 Mapping
Unicode to GB2312 Mapping - All 7,445 Characters