Character Set Encoding Comparison

JDK Tutorials - Herong's Tutorial Examples

∟Character Set Encoding Comparison

This section provides a tutorial example on how to compare some commonly used character set encodings in number of characters, byte sequence sizes and ASCII compatibilities.

Here is the output of my sample program, EncodingCounter.java, for US-ASCII encoding:

herong> java EncodingCounter.java US-ASCII

US-ASCII encoding:
0000 > 00 - 007F > 7F = 128
0080 > XX - FFFF > XX = 65408
Total characters = 65536
Valid characters = 128
Invalid characters = 65408

This tells us that the US-ASCII character set has only 128 characters.

Run EncodingCounter.java again with ISO-8859-1 (Latin 1) as argument, you will get:

herong> java EncodingCounter.java ISO-8859-1

ISO-8859-1 encoding:
0000 > 00 - 00FF > FF = 256
0100 > XX - FFFF > XX = 65280
Total characters = 65536
Valid characters = 256
Invalid characters = 65280

This tells us that the ISO-8859-1 character set has only 256 characters.

The following table is based on the output of the EncodingCouter.java program. It provides a brief comparison between the some commonly used encodings:

Encoding     Map     US-ASCII
Name         Size    Compatible   Notes
----------   -----   ------------------
US-ASCII     128     Y   7-bit characters only
ISO-8859-1   256     Y   8-bit (single byte) characters
CP1252       251     Y   One byte output, with code points up to 0x2122
UTF-8        63488   Y   1-3 bytes,
UTF-16BE     63488   N   2 bytes, carbon copying the code points
UTF-16LE     63488   N   2 bytes, reversing the code points
UTF-16       63488   N   4 bytes, last 2 bytes = UTF-16BE
GBK          24068   Y   1-2 bytes, Chinese 1993 standard
GB18030      63488   Y   1-4 bytes, superset of GBK, 2000 standard