|
JDK - Encoding Conversion
Part:
1
2
3
4
(Continued from previous part...)
Compile this program and use it to convert our hello message file into several
encodings:
javac EncodingConverter.java
java EncodingConverter hello.utf-16be utf-16be hello.ascii ascii
java EncodingConverter hello.utf-16be utf-16be hello.iso-8859-1 iso-...
java EncodingConverter hello.utf-16be utf-16be hello.utf-8 utf-8
java EncodingConverter hello.utf-16be utf-16be hello.gbk gbk
java EncodingConverter hello.utf-16be utf-16be hello.big5 big5
java EncodingConverter hello.utf-16be utf-16be hello.shift_jis shift_jis
By observing the output files, you should notice this followings:
hello.ascii - In this file, only the English message is good, because it contains only
ASCII characters. Both Simplified Chinese and Traditional Chinese messages
are not good. Characters in these messages are replaced by 0x3F, an indication
of invalid code.
hello.iso-8859-1 - This is identical to hello.ascii, because there is no
characters in the 0x80 - 0xFF range.
hello.utf-8 - This file contains all messages with no damages. The ASCII
characters are stored as one-byte characters as expected.
hello.gbk - In this file, the Simplified Chinese message is good. In fact,
characters in the Simplified Chinese message are stored as code values in
GBK character set standard. The English message is also good, because GBK is
ASCII backward compatible. We are lucky with the Traditional Chinese message,
because the Big5 characters used in the message are also valid in GBK standard.
If you use some Big5 special characters, the result could be different.
hello.big5 - In this file, the Traditional Chinese message is good. In fact,
characters in the Traditional Chinese message are stored as code values in
Big5 character set standard. The English message is also good, because Big5 is
ASCII backward compatible. We are not lucky with the Simplified Chinese message,
two GB characters used in the message are not valid in Big5 standard. 0x3F was
stored for those characters.
hello.shift_jis - In this file, the English message is still good. Some of the
characters from both Simplified and Traditional Chinese messages are invalid,
replaced by 0x3F placeholders. Some of the Chinese characters are still valid
in Shift_JIS character set. This is not so surprising, because there are many
shared characters in Chinese and Japanese.
Viewing Unicode Text
Now, we have this greeting messages saved in many different encodings. The next
question is how do display them as glyph of the corresponding languages on the screen.
One of the ways I have used in the past is to run a multi-language enabled Web browser
like IE to view the text files. To do this, we have to mark up the text into a html file,
by using program like this one here:
/**
* EncodingHtml.java
* Copyright (c) 2002 by Dr. Herong Yang
*
* This program allows you to mark up a text file into html file.
*/
import java.io.*;
import java.util.*;
class EncodingHtml {
static HashMap charsetMap = new HashMap();
public static void main(String[] a) {
String inFile = a[0];
String inCharsetName = a[1];
String outFile = inFile + ".html";
try {
InputStreamReader in = new InputStreamReader(
new FileInputStream(inFile), inCharsetName);
OutputStreamWriter out = new OutputStreamWriter(
new FileOutputStream(outFile), inCharsetName);
writeHead(out, inCharsetName);
int c = in.read();
int n = 0;
while (c!=-1) {
out.write(c);
n++;
c = in.read();
}
writeTail(out);
in.close();
out.close();
System.out.println("Number of characters: "+n);
} catch (IOException e) {
System.out.println(e.toString());
}
}
public static void writeHead(OutputStreamWriter out, String cs)
throws IOException {
out.write("<html><head>\n");
out.write("<meta http-equiv=\"Content-Type\""+
" content=\"text/html; charset="+cs+"\">\n");
out.write("</head><body><pre>");
}
public static void writeTail(OutputStreamWriter out)
throws IOException {
out.write("</pre></body></html>\n");
}
}
Now, let's compile this program and run it with hello.utf-8:
javac EncodingHtml.java
java EncodingHtml hello.utf-8 utf-8
If you have installed IE with the Chinese language supports, you should
be able to open the output file, hello.utf-8.html, and enjoy reading the
messages in English, Simplified Chinese, and Traditional Chinese.
Then, run EncodingHtml.java with other encodings,
java EncodingHtml hello.gbk gbk
java EncodingHtml hello.big5 big5
java EncodingHtml hello.shift_jis shift_jis
View the output files with IE, and compare the results:
- hello.utf-8.html - IE auto sets View/Encoding to utf-8. All messages are perfect.
- hello.gbk.html - IE auto sets View/Encoding to gb2312. All messages are perfect.
- hello.big5.html - IE auto sets View/Encoding to big5. Simplified Chinese message has two bad characters.
- hello.shift_jis - IE auto sets View/Encoding to shift_jis. Both Simplified and Traditional Chinese messages have bad characters.
If you manually change the setting of View/Encoding, IE will not be able to show the
message with the right glyph.
(Continued on next part...)
Part:
1
2
3
4
|