Unicode Tutorials - Herong's Tutorial Examples - v5.32, by Herong Yang
Character Set Encoding Comparison
This section provides a tutorial example on how to compare some commonly used character set encodings in number of characters, byte sequence sizes and ASCII compatibilities.
Here is the output of my sample program, EncodingCounter2.java, for US-ASCII encoding:
C:\herong>javac EncodingCounter2.java C:\herong>java EncodingCounter2 US-ASCII US-ASCII encoding: 00000000 > 00 - 0000007F > 7F = 128 00000080 > XX - 000FFFFF > XX = 1048448 Total characters = 1048576 Valid characters = 128 Invalid characters = 1048448
This tells us that the US-ASCII character set has only 128 characters.
Run EncodingCounter.java again with ISO-8859-1 (Latin 1) as argument, you will get:
C:\herong>java EncodingCounter2 ISO-8859-1 ISO-8859-1 encoding: 00000000 > 00 - 000000FF > FF = 256 00000100 > XX - 000FFFFF > XX = 1048320 Total characters = 1048576 Valid characters = 256 Invalid characters = 1048320
This tells us that the ISO-8859-1 character set has only 256 characters.
The following table is based on the output of the EncodingCouter.java program. It provides a brief comparison between the some commonly used encodings:
Encoding Map US-ASCII Name Size Compatible Notes US-ASCII 128 Y 7-bit characters only ISO-8859-1 256 Y 8-bit (single byte) characters CP1252 251 Y One byte output, with code points up to 0x2122 UTF-8 1046528 Y 1-4 bytes, complex algorithm UTF-16BE 1046528 N 2-4 bytes, code point and surrogate pairs UTF-16LE 1046528 N 2-4 bytes, reversing byte pair of UTF-16BE UTF-16 1046528 N 4-6 bytes, same as UTF-16BE with leading BOM UTF-32BE 1046528 N 4 bytes, code point UTF-32LE 1046528 N 4 bytes, reversing byte sequence of UTF-32BE UTF-32 1046528 N 4 bytes, same as UTF-32BE GB2312 7573 Y 1-2 bytes, Chinese 1980 standard GBK 24068 Y 1-2 bytes, Chinese 1993 standard GB18030 1046528 Y 1-4 bytes, superset of GBK, 2000 standard BIG5 13831 Y 1-2 bytes, traditional Chinese character set
Table of Contents
ASCII Character Set and Encoding
GB2312 Character Set and Encoding
GB18030 Character Set and Encoding
JIS X0208 Character Set and Encodings
UTF-8 (Unicode Transformation Format - 8-Bit)
UTF-16, UTF-16BE and UTF-16LE Encodings
UTF-32, UTF-32BE and UTF-32LE Encodings
Python Language and Unicode Characters
Java Language and Unicode Characters
Character Set Encoding Map Analyzer
Character Set Encoding Maps - US-ASCII and ISO-8859-1/Latin 1
Character Set Encoding Maps - CP1252/Windows-1252
Character Set Encoding Maps - Unicode UTF-8
Character Set Encoding Maps - Unicode UTF-16, UTF-16BE, UTF-16LE
Character Set Encoding Maps - Unicode UTF-32, UTF-32BE, UTF-32LE
Character Counter Program for Any Given Encoding
►Character Set Encoding Comparison
Encoding Conversion Programs for Encoded Text Files
Using Notepad as a Unicode Text Editor
Using Microsoft Word as a Unicode Text Editor