|
JDK - Encoding Map Counts
Part:
1
2
(Continued from previous part...)
public static byte[] encodeByEncoder(char c, String cs) {
Charset cso = null;
byte[] b = null;
try {
cso = Charset.forName(cs);
CharsetEncoder e = cso.newEncoder();
e.reset();
ByteBuffer bb = e.encode(CharBuffer.wrap(new char[] {c}));
if (bb.limit()>0) b = copyBytes(bb.array(),bb.limit());
} catch (IllegalCharsetNameException e) {
System.out.println(e.toString());
} catch (CharacterCodingException e) {
// invalid character, return null
}
return b;
}
public static void printBytes(byte[] b) {
if (b!=null) {
for (int j=0; j<b.length; j++)
System.out.print(" "+byteToHex(b[j]));
} else {
System.out.print(" XX");
}
}
public static byte[] copyBytes(byte[] a, int l) {
byte[] b = new byte[l];
for (int i=0; i<Math.min(l,a.length); i++) b[i] = a[i];
return b;
}
public static String byteToHex(byte b) {
char[] a = { hexDigit[(b >> 4) & 0x0f], hexDigit[b & 0x0f] };
return new String(a);
}
public static String charToHex(char c) {
byte hi = (byte) (c >>> 8);
byte lo = (byte) (c & 0xff);
return byteToHex(hi) + byteToHex(lo);
}
}
Note that:
- CharsetEncoder.encode() is used to encode the code points stored as "char" type.
- Since Java can only encode code points in the 0x0000 - 0xFFFF range, only
a subset of the character set will be encoded for some encodings, like UTF-8,
which can encode code points up to 0x10FFFF.
- The encoding name should be specified as command argument.
Run this program with US-ASCII as argument, you will get:
US-ASCII encoding:
0000 > 00 - 007F > 7F = 128
0080 > XX - FFFF > XX = 65408
Total characters = 65536
Valid characters = 128
Invalid characters = 65408
This tells us that the US-ASCII character set has only 128 characters.
Run this program with ISO-8859-1 (Latin 1) as argument, you will get:
ISO-8859-1 encoding:
0000 > 00 - 00FF > FF = 256
0100 > XX - FFFF > XX = 65280
Total characters = 65536
Valid characters = 256
Invalid characters = 65280
This tells us that the US-ASCII character set has only 256 characters.
Comparison of Encoding Maps
The following table is based on the output of the EncodingCouter program with
different supported encoding names. It provides a brief comparison between
the some different encodings.
Encoding Map US-ASCII
Name Size Compatible Notes
US-ASCII 128 Y 7-bit characters only
ISO-8859-1 256 Y 8-bit (single byte) characters
CP1252 251 Y One byte output, with code points up to 0x2122
UTF-8 63488 Y 1-3 bytes,
UTF-16BE 63488 N 2 bytes, carbon copying the code points
UTF-16LE 63488 N 2 bytes, reversing the code points
UTF-16 63488 N 4 bytes, last 2 bytes = UTF-16BE
GBK 24068 Y 1-2 bytes, Chinese 1993 standard
GB18030 63488 Y 1-4 bytes, superset of GBK, 2000 standard
Source: Herong's Notes on JDK.
Part:
1
2
|