Big5Unicode.java - Big5 to Unicode Mapping

Big5Unicode.java is a Java program that generates a table to map all Big5 characters from Big5 Codes to Unicode Codes.

If we compare Big5 codes with Unicode codes of same Chinese characters, we will not find any mathematical relations. So if someone wants to convert a Chinese text file from the Big5 encoding to a Unicode encoding, he/she needs to use a big mapping table that covers all 13,461 Big5 characters.

If we search the Internet, we probably can find copies of such mapping table in different formats.

But if you have JDK (Java Development Kit) installed on your computer, you build a Big5 to Unicode mapping table yourself with a simple program.

Here is a Java program I wrote to build a Big5 to Unicode mapping table, Big5Unicode.java. The output of the program includes 3 columns per character:

/* Big5Unicode.java
 - Copyright (c) 2015, HerongYang.com, All Rights Reserved.
 */
import java.io.*;
import java.nio.*;
import java.nio.charset.*;

class Big5Unicode {
  static OutputStream out = null;
  static char hexDigit[] = {'0', '1', '2', '3', '4', '5', '6', '7',
                           '8', '9', 'A', 'B', 'C', 'D', 'E', 'F'};
  static String blk_name[] = {"Special Symbols",
       "Level 1 Characters", "Level 2 Characters"};
  static int blk_first[] = {0xA140, 0xA440, 0xC940};
  static int blk_last[] = {0xA3BF, 0xC67E, 0xF9D5};
  static int blk_size[] = {408, 5401, 7652};
  static int blk_count[] = {0, 0, 0};

  public static void main(String[] args) {
    try {
      out = new FileOutputStream("big5-unicode.big5");
      writeCode();
      out.close();
    } catch (IOException e) {
      System.out.println(e.toString());
    }
  }

  public static void writeCode() throws IOException {
    String name = null;
    CharsetDecoder b5dc = Charset.forName("Big5").newDecoder();
    CharsetEncoder uxec = Charset.forName("UTF-16BE").newEncoder();
    ByteBuffer b5bb = null;
    ByteBuffer uxbb = null;
    CharBuffer cb = null;

    for (int i=0xA1; i<=0xFF; i++) {
      int blk = getBlock(i);
      if (blk==-1) continue;

      name = blk_name[blk];
      writeln();
      writeString("<p><b>Row ");
      writeHex((byte)i);
      writeString(": "+name+"</b></p>");
      writeln();

      writeln();
      writeHeader();
      for (int j=0x40; j<=0xFF; j++) {
        byte hi = (byte)(i);
        byte lo = (byte)(j);

        if (validBig5(i, j, blk)) {
          b5bb = ByteBuffer.wrap(new byte[]{hi, lo});
          try {
            cb = b5dc.decode(b5bb);
            uxbb = uxec.encode(cb);
            writeByte(hi);
            writeByte(lo);
            writeString(" ");
            writeHex(hi);
            writeHex(lo);
            blk_count[blk] = blk_count[blk] + 1;

          } catch (CharacterCodingException e) {
            cb = null;
            uxbb = null;
            writeBig5Space();
            writeString(" fail");
          }
        } else {
          cb = null;
          uxbb = null;
          writeBig5Space();
          writeString(" null");
        }

        writeString(" ");
        writeByteBuffer(uxbb, 2);

        if ((j+1)%4 == 0) {
          writeln();
        } else {
          writeString("   ");
        }
      }
      writeFooter();
    }

    for (int l=0; l<blk_name.length; l++) {
      System.out.println(blk_name[l]+": "
        + blk_count[l]+" of "+blk_size[l]);
    }
  }

  public static void writeln() throws IOException {
    out.write(0x0D);
    out.write(0x0A);
  }

  public static void writeByte(byte b) throws IOException {
    out.write(b & 0xFF);
  }

  public static void writeByteBuffer(ByteBuffer b, int l)
    throws IOException {
    int i = 0;
    if (b==null) {
      writeString("null");
      i = 2;
    } else {
      for (i=0; i<b.limit(); i++) writeHex(b.get(i));
    }
    for (int j=i; j<l; j++) writeString("  ");
  }

  public static void writeBig5Space() throws IOException {
    out.write(0xA1);
    out.write(0x40);
  }

  public static void writeString(String s) throws IOException {
    if (s!=null) {
      for (int i=0; i<s.length(); i++) {
        out.write((int) (s.charAt(i) & 0xFF));
       }
    }
  }

  public static void writeNumber(int i) throws IOException {
    String s = "00" + String.valueOf(i);
    writeString(s.substring(s.length()-2,s.length()));
  }

  public static void writeHex(byte b) throws IOException {
    out.write((int) hexDigit[(b >> 4) & 0x0F]);
    out.write((int) hexDigit[b & 0x0F]);
  }

  public static void writeHeader() throws IOException {
    writeString("<pre class=\"chinese\">");
    writeBig5Space();
    writeString(" Big5 Uni.");
    writeString("   ");
    writeBig5Space();
    writeString(" Big5 Uni.");
    writeString("   ");
    writeBig5Space();
    writeString(" Big5 Uni.");
    writeString("   ");
    writeBig5Space();
    writeString(" Big5 Uni.");
    writeln();
    writeln();
  }

  public static void writeFooter() throws IOException {
    writeString("</pre>");
    writeln();
  }

  public static boolean validBig5(int i, int j, int blk) {
    // valid ranges for j: 0x40 - 0x7E and 0xA1 - 0xFE.
    if (j<0x40) return false;
    if (j>0x7E && j<0xA1) return false;
    if (j>0xFE) return false;

    int last_i = blk_last[blk] >> 8;
    int last_j = blk_last[blk] & 0xFF;
    if (i==last_i && j>last_j) return false;

    return true;
  }

  public static int getBlock(int i) {
    for (int l=0; l<blk_first.length; l++) {
      int first = blk_first[l] >> 8;
      int last = blk_last[l] >> 8;
      if (i>=first && i<=last) return l;
    }
    return -1;
  }
}

Notes on the Java source code:

You can compile and run this Java program in with any JDK versions from JDK 8 to JDK 20. Here is the execution output:

herong$ javac Big5Unicode.java
herong$ java Big5Unicode

Special Symbols: 406 of 408
Level 1 Characters: 5401 of 5401
Level 2 Characters: 7652 of 7652

As you can see from the output, JDK failed to decode 2 Big5 codes in the Special Symbols block: 0xA1C3 () and 0xA1C5 (ˍ). So I need to map them manually as:

 A1C3 U+FFE3
ˍ A1C5 U+02CD

JDK also mapped 3 Big5 code points incorrectly. I need to fix them manually.

A1FE: Java bug - wrong mapping ( A1FE U+2571,  A2AC U+2571)
   It should be:  A1FE U+FF0F
A240: Java bug - wrong mapping ( A240 U+2572,  A2AD U+2572)
   It should be:  A240 U+FF3C
A15A: Java bug - wrong mapping ( A15A U+FF3F, _ A1C4 U+FF3F)
  It should be:  A15A U+2574

The entire output of this program is included later in the book.

Table of Contents

 About This Book

 Introduction to Big5

Big5Unicode.java - Big5 to Unicode Mapping

 Big5 to Unicode Mapping - Special Symbols

 Big5 to Unicode Mapping - Level 1 Characters

 Big5 to Unicode Mapping - Level 2 Characters

 UnicodeBig5.java - Unicode to Big5 Mapping

 Unicode to Big5 Mapping - All 13,461 Characters

 References of This Book - Big5 Tutorials

 Full Version in PDF/ePUB