EncodingConverter.java - Encoding Conversion Sample Program

This section provides a tutorial example on how to write a sample program, EncodingConverter.java, to convert text files from one character set encoding to another.

With the help of HexWriter.java, I know that file hello.utf-16be stores strings of characters in UTF-16BE encoding.

Now I want to write a sample program, EncodingConverter.java, to convert text files from one character set encoding to another:

/* EncodingConverter.java
 * Copyright (c) 2019 HerongYang.com. All Rights Reserved.
 *
 * This program allows you to convert a text file in one encoding 
 * to another file in a different encoding.
 */
import java.io.*;
class EncodingConverter {
   public static void main(String[] a) {
      String inFile = a[0];
      String inCharsetName = a[1];
      String outFile = a[2];
      String outCharsetName = a[3];
      try {
         InputStreamReader in = new InputStreamReader(
            new FileInputStream(inFile), inCharsetName);
         OutputStreamWriter out = new OutputStreamWriter(
            new FileOutputStream(outFile), outCharsetName);
         int c = in.read();
         int n = 0;
         while (c!=-1) {
            out.write(c);
            n++;
            c = in.read();
         }
         in.close();
         out.close();
         System.out.println("Number of characters: "+n);
         System.out.println("Number of input bytes: "
            +(new File(inFile)).length());
         System.out.println("Number of output bytes: "
            +(new File(outFile)).length());
      } catch (IOException e) {
         System.out.println(e.toString());
      }
   }
}

Compile this program and use it to convert our hello message file into several encodings:

C:\herong>javac EncodingConverter.java

C:\herong>java EncodingConverter hello.utf-16be utf-16be \
   hello.ascii ascii
Number of characters: 84
Number of input bytes: 168
Number of output bytes: 84

C:\herong>java EncodingConverter hello.utf-16be utf-16be \
   hello.iso-8859-1 iso-8859-1
Number of characters: 84
Number of input bytes: 168
Number of output bytes: 94

C:\herong>java EncodingConverter hello.utf-16be utf-16be \
   hello.utf-8 utf-8
Number of characters: 84
Number of input bytes: 168
Number of output bytes: 104

C:\herong>java EncodingConverter hello.utf-16be utf-16be \ 
   hello.gbk gbk
Number of characters: 84
Number of input bytes: 168
Number of output bytes: 94

C:\herong>java EncodingConverter hello.utf-16be utf-16be \ 
   hello.big5 big5
Number of characters: 84
Number of input bytes: 168
Number of output bytes: 92

C:\herong>java EncodingConverter hello.utf-16be utf-16be 
   hello.shift_jis shift_jis
Number of characters: 84
Number of input bytes: 168
Number of output bytes: 89

By reviewing output files, you should see that:

hello.ascii - In this file, only the English message is good, because it contains only ASCII characters. Both Simplified Chinese and Traditional Chinese messages are not good. Characters in these messages are replaced by 0x3F, an indication of invalid code.

hello.iso-8859-1 - This is identical to hello.ascii, because there is no characters in the 0x80 - 0xFF range.

hello.utf-8 - This file contains all messages with no damages. The ASCII characters are stored as one-byte characters as expected.

hello.gbk - In this file, the Simplified Chinese message is good. In fact, characters in the Simplified Chinese message are stored as code values in GBK character set standard. The English message is also good, because GBK is ASCII backward compatible. We are lucky with the Traditional Chinese message, because the Big5 characters used in the message are also valid in GBK standard. If you use some Big5 special characters, the result could be different.

hello.big5 - In this file, the Traditional Chinese message is good. In fact, characters in the Traditional Chinese message are stored as code values in Big5 character set standard. The English message is also good, because Big5 is ASCII backward compatible. We are not lucky with the Simplified Chinese message, two GB characters used in the message are not valid in Big5 standard. 0x3F was stored for those characters.

hello.shift_jis - In this file, the English message is still good. Some of the characters from both Simplified and Traditional Chinese messages are invalid, replaced by 0x3F placeholders. Some of the Chinese characters are still valid in Shift_JIS character set. This is not so surprising, because there are many shared characters in Chinese and Japanese.

Table of Contents

 About This Book

 Character Sets and Encodings

 ASCII Character Set and Encoding

 GB2312 Character Set and Encoding

 GB18030 Character Set and Encoding

 JIS X0208 Character Set and Encodings

 Unicode Character Set

 UTF-8 (Unicode Transformation Format - 8-Bit)

 UTF-16, UTF-16BE and UTF-16LE Encodings

 UTF-32, UTF-32BE and UTF-32LE Encodings

 Python Language and Unicode Characters

 Java Language and Unicode Characters

 Character Encoding in Java

 Character Set Encoding Maps

Encoding Conversion Programs for Encoded Text Files

 \uxxxx - Entering Unicode Data in Java Programs

 HexWriter.java - Converting Encoded Byte Sequences to Hex Values

EncodingConverter.java - Encoding Conversion Sample Program

 Viewing Encoded Text Files in Web Browsers

 Unicode Signs in Different Encodings

 Using Notepad as a Unicode Text Editor

 Using Microsoft Word as a Unicode Text Editor

 Using Microsoft Excel as a Unicode Text Editor

 Unicode Fonts

 Archived Tutorials

 References

 Full Version in PDF/EPUB