Herong's Tutorial Notes on Unicode
Dr. Herong Yang, Version 4.02

JDK - Encoding Conversion

Part:   1  2  3  4 

Notes and sample codes bellow are based on J2SDK 1.4.1_01.

Unicode Data Entry

Encoding conversion is about reading characters stored in a file encoded with encoding A, and writing them into another file encoded with encoding B.

Before going into details of encoding conversion, let's talk briefly about Unicode data entry. How do we enter Unicode characters into a file? There are a couple of ways to do that:

  • Using encoding specific word processors. Usually, one word processor will allow you to enter characters of a particular language.
  • Using Hex editors to enter directly the byte sequences representing the desired characters in a specific encoding.
  • Using Unicode based programming language to enter the desired characters as string literals.

Word processors are too specific to be discussed here.

Hex editors are ultimate data entry tools for Unicode characters. They can also be used to inspect and repair encoded text files. But Hex editors are very hard to use.

Notepad on Windows is not a Hex editor.

UltraEdit on Windows is a Hex editor.

Using Unicode based programming language, like Java, to enter Unicode characters into a file is very interesting. Here is a sample program, UnicodeHello.java:

/**
 * UnicodeHello.java
 * Copyright (c) 2002 by Dr. Herong Yang
 *
 * This program is a simple tool to allow you to enter several lines of
 * text, and writh them into a file with of the specified encoding 
 * (charset name). The input text lines uses Java string convention, 
 * which allows you to enter ASCII characters directly, and any non
 * ASCII characters with escape sequences.
 *
 * This version of the program is to write out the "Hello world!" message
 * in some different languages.
 */
import java.io.*;
class UnicodeHello {
   public static void main(String[] a) {
      // The following Array contains text to be saved into the output
      // File. To enter your own text, just replace this Array.
      String[] text = {
"Hello computer! - English", // ASCII
"\u7535\u8111\u4F60\u597D\uFF01 - Simplified Chinese", // GB2312
"\u96FB\u8166\u4F60\u597D\uFE57 - Traditional Chinese" // Big5
      };
      String outFile = "hello.utf-16be";
      if (a.length>0) outFile = a[0];
      String outCharsetName = "utf-16be";
      if (a.length>1) outCharsetName = a[1];
      String crlf = System.getProperty("line.separator");
      try {
         OutputStreamWriter out = new OutputStreamWriter(
            new FileOutputStream(outFile), outCharsetName);
         for (int i=0; i<text.length; i++) {
            out.write(text[i]);
            out.write(crlf);
         }
         out.close();
      } catch (IOException e) {
         System.out.println(e.toString());
      }
   }
}

As you can see from the source code, this program will write the "Hello computer!" message in several languages. Let's compile this program and run it to get the characters saved into a file with UTF-8 encoding:

javac UnicodeHello.java
java UnicodeHello hello.utf-16be utf-16be

(Continued on next part...)

Part:   1  2  3  4 

Dr. Herong Yang, updated in 2007
Herong's Tutorial Notes on Unicode - JDK - Encoding Conversion