JDK - Encoding Maps
Part:
1
2
3
This chapter helps you understand:
- Encoding Map Counter
- Comparison of Encoding Maps
Notes and sample codes bellow are based on J2SDK 1.4.1_01.
Encoding Map Analyzer
As mentioned in my other note, "Character Set and Encoding", J2SDK 1.4.1_01
for Windows 2000 provides 48 build-in encodings.
I have the following program to analyze a given encoding and print
a map between the code points (from 0x0000 to 0xFFFF) and the encoded
byte sequences:
/**
* EncodingAnalyzer.java
* Copyright (c) 2002 by Dr. Herong Yang
*/
import java.io.*;
class EncodingAnalyzer {
static char hexDigit[] = {'0', '1', '2', '3', '4', '5', '6', '7',
'8', '9', 'A', 'B', 'C', 'D', 'E', 'F'};
public static void main(String[] a) {
String charset = null;
if (a.length>0) charset = a[0];
if (charset==null) System.out.println("Default encoding:");
else System.out.println(charset+" encoding:");
int lastByte = 0;
int lastLength = 0;
byte[] startSequence = null;
char startChar = 0;
byte[] endSequence = null;
char endChar = 0;
boolean isFirstChar = true;
for (int i=0; i<0x010000; i++) {
char c = (char) i;
String s = String.valueOf(c);
byte[] b = null;
if (charset==null) {
b = s.getBytes();
} else {
try {
b = s.getBytes(charset);
} catch (UnsupportedEncodingException e) {
System.out.println(e.toString());
break;
}
}
int l = b.length;
int lb = ((int) b[l-1]) & 0x00FF;
if (isFirstChar==true) {
isFirstChar = false;
startSequence = b;
startChar = c;
lastByte = lb - 1;
lastLength = l;
}
if (!(l==lastLength && (lb==lastByte+1 || lb==lastByte))) {
System.out.print(charToHex(startChar)+" >");
printBytes(startSequence);
System.out.print(" - "+charToHex(endChar)+" >");
printBytes(endSequence);
System.out.println("");
startSequence = b;
startChar = c;
}
endSequence = b;
endChar = c;
lastLength = l;
lastByte = lb;
}
System.out.print(charToHex(startChar)+" >");
printBytes(startSequence);
System.out.print(" - "+charToHex(endChar)+" >");
printBytes(endSequence);
System.out.println("");
}
public static void printBytes(byte[] b) {
for (int j=0; j<b.length; j++)
System.out.print(" "+byteToHex(b[j]));
}
public static String byteToHex(byte b) {
char[] a = { hexDigit[(b >> 4) & 0x0f], hexDigit[b & 0x0f] };
return new String(a);
}
public static String charToHex(char c) {
byte hi = (byte) (c >>> 8);
byte lo = (byte) (c & 0xff);
return byteToHex(hi) + byteToHex(lo);
}
}
Note that:
- String.getBytes() is used to encode the code points stored as "char" type.
- Since Java can only encode code points in the 0x0000 - 0xFFFF range, only
a subset of the character set will be encoded for some encodings, like UTF-8,
which can encode code points upto 0x10FFFF.
- The encoding name should be specified as command argument.
The output of this program will be discussed in the sections bellow.
US-ASCII
US-ASCII encoding:
Code Code
Point Point
0000 > 00 - 007F > 7F
0080 > 3F - FFFF > 3F
- This is a very simple map.
- The encoded byte sequence is one byte only, taking the lower value
byte of the code point.
- Valid code points only in the 0x0000 - 0x007F range.
(Continued on next part...)
Part:
1
2
3
|