|
JDK - Encoding Conversion
Part:
1
2
3
4
(Continued from previous part...)
Unicode Signs in Different Encodings
I wanted to play with my utility programs mentioned in this note one more time with
this some Unicode signs. So I copied UnicodeHello.java and made UnicodeSign.java:
/**
* UnicodeSign.java
* Copyright (c) 2002 by Dr. Herong Yang
*
* This program is a simple tool to allow you to enter several lines of
* text, and writh them into a file with of the specified encoding
* (charset name). The input text lines uses Java string convention,
* which allows you to enter ASCII characters directly, and any non
* ASCII characters with escape sequences.
*
* This version of the program is to write out some interesting signs.
*/
import java.io.*;
class UnicodeSign {
public static void main(String[] a) {
// The following Array contains text to be saved into the output
// File. To enter your own text, just replace this Array.
String[] text = {
"U+005C(\\)REVERSE SOLIDUS", //\u005C is '\', cannot be entered directly
"U+007E(\u007E)TILDE",
"U+00A2(\u00A2)CENT SIGN",
"U+00A3(\u00A3)POUND SING",
"U+00A5(\u00A5)YEN SIGN",
"U+00A6(\u00A6)BROKEN BAR",
"U+00A7(\u00A7)SECTION SIGN",
"U+00A9(\u00A9)COPYRIGHT SIGN",
"U+00AC(\u00AC)NOT SIGN",
"U+00AE(\u00AE)REGISTERED SIGN",
"U+2022(\u2022)BULLET",
"U+2023(\u2023)TRIANGULAR BULLET",
"U+203B(\u203B)REFERENCE MARK",
"U+2043(\u2043)HYPHEN BULLET",
"U+FF04(\uFF04)FULLWIDTH DOLLAR SIGN",
"U+FF05(\uFF05)FULLWIDTH PERCENT SIGN",
"U+FF08(\uFF08)FULLWIDTH LEFT PARENTHESIS",
"U+FF09(\uFF09)FULLWIDTH RIGHT PARENTHESIS",
"U+FF10(\uFF10)FULLWIDTH DIGIT ZERO",
"U+FF11(\uFF11)FULLWIDTH DIGIT ONE",
"U+FF21(\uFF21)FULLWIDTH LATIN CAPITAL LETTER A",
"U+FF22(\uFF22)FULLWIDTH LATIN CAPITAL LETTER B",
"U+FF41(\uFF41)FULLWIDTH LATIN SMALL LETTER A",
"U+FF42(\uFF42)FULLWIDTH LATIN SMALL LETTER B",
"U+FFE0(\uFFE0)FULLWIDTH CENT SIGN",
"U+FFE1(\uFFE1)FULLWIDTH POND SIGN",
"U+FFE5(\uFFE5)FULLWIDTH YEN SIGN"
};
String outFile = "sign.utf-16be";
if (a.length>0) outFile = a[0];
String outCharsetName = "utf-16be";
if (a.length>1) outCharsetName = a[1];
String crlf = System.getProperty("line.separator");
try {
OutputStreamWriter out = new OutputStreamWriter(
new FileOutputStream(outFile), outCharsetName);
for (int i=0; i<text.length; i++) {
out.write(text[i]);
out.write(crlf);
}
out.close();
} catch (IOException e) {
System.out.println(e.toString());
}
}
}
Then I ran this program, and converted the output file with different encodings:
javac UnicodeSign.java
java UnicodeSign sign.utf-16be utf-16be
java EncodingConverter sign.utf-16be utf-16be sign.utf-8 utf-8
java EncodingHtml sign.utf-8 utf-8
java EncodingConverter sign.utf-16be utf-16be sign.gbk gbk
java EncodingHtml sign.gbk gbk
java EncodingConverter sign.utf-16be utf-16be sign.shift_jis shift_jis
java EncodingHtml sign.shif_jis shift_jis
java EncodingConverter sign.utf-16be utf-16be sign.johab johab
java EncodingHtml sign.johab johab
Then I viewed the different encoded test files with IE,
and noticed the following:
- sign.utf-8.html - The signs looked very good except two: TRIANGULAR BULLET and
DASH BULLET.
- sign.gbk.html - Many low-code-point signs were wrong, like CENT SIGN.
- sign.shift_jis.html - Some signs were wrong, like FULLWIDTH CENT SIGN. but CENT SIGN
is correct.
- sign.johab.html - Like the gbk encoding, many low-code-point signs were wrong, like
CENT SIGN.
Conclusion:
- Java program seems to be good way of stored Unicode text into a file, if you don't
have good word processor that handles Unicode text.
- When converting text file from one encoding to another, you need to make sure
that all characters in the text file are valid characters in the character set of
the output encoding.
- IE is a very good tool to view Unicode text, if you installed all the required fonts.
Exercise: Adding more messages in other languages in UnicodeHello.java.
Source: Herong's Notes on JDK.
Part:
1
2
3
4
|