Character Set Encoding Maps - Unicode UTF-16, UTF-16LE, UTF-16BE

JDK Tutorials - Herong's Tutorial Examples

∟Character Set Encoding Maps - Unicode UTF-16, UTF-16LE, UTF-16BE

This section provides a tutorial example of analyzing and printing character set encoding maps for 3 encoding, UTF-16, UTF-16LE, and UTF-16BE, for Unicode character set.

Here is the output of my sample program, EncodingAnalyzer.java, for UTF-16 encoding:

herong> java EncodingAnalyzer.java UTF-16

Code Point > Byte Sequence - Code Point > Byte Sequence

0000 > FE FF 00 00 - 00FF > FE FF 00 FF
0100 > FE FF 01 00 - 01FF > FE FF 01 FF
0200 > FE FF 02 00 - 02FF > FE FF 02 FF
......
D700 > FE FF D7 00 - D7FF > FE FF D7 FF
D800 > FE FF FF FD - DFFF > FE FF FF FD
E000 > FE FF E0 00 - E0FF > FE FF E0 FF
E100 > FE FF E1 00 - E1FF > FE FF E1 FF
E200 > FE FF E2 00 - E2FF > FE FF E2 FF
......
FF00 > FE FF FF 00 - FFFF > FE FF FF FF

The encoding map of UTF-16, which is another encoding used for the Unicode character set, is much simpler than UTF-8:

The output sequence is a fixed length, 2 bytes. Note that the leading 0xFEFF is a format flag.
It is not backward compatible with US-ASCII.
One section of code points is not valid: 0xD800 - 0xDFFF.

Here is the output for UTF-16LE encoding, the little-endian variation of UTF-16 encoding:

herong> java EncodingAnalyzer.java UTF-16LE

Code Point > Byte Sequence - Code Point > Byte Sequence

0000 > 00 00 - D7FF > FF D7
D800 > FD FF - DFFF > FD FF
E000 > 00 E0 - FFFF > FF FF

The encoding map of UTF-16LE is so simple:

The output sequence is a fixed length, 2 bytes.
It is not backward compatible with US-ASCII.
One section of code points is not valid: 0xD800 - 0xDFFF.
The rest of the code points are encoded by reversing the two bytes of the code points.

Here is the output for UTF-16BE encoding, the big-endian variation of UTF-16 encoding:

herong> java EncodingAnalyzer.java UTF-16BE

Code Point > Byte Sequence - Code Point > Byte Sequence

0000 > 00 00 - 00FF > 00 FF
0100 > 01 00 - 01FF > 01 FF
0200 > 02 00 - 02FF > 02 FF
......
D700 > D7 00 - D7FF > D7 FF
D800 > FF FD - DFFF > FF FD
E000 > E0 00 - E0FF > E0 FF
E100 > E1 00 - E1FF > E1 FF
E200 > E2 00 - E2FF > E2 FF
......
FF00 > FF 00 - FFFF > FF FF

The encoding map of UTF-16BE is also simple:

The output sequence is a fixed length, 2 bytes.
It is not backward compatible with US-ASCII.
One section of code points is not valid: 0xD800 - 0xDFFF.
The rest of the code points are encoded by carbon copying the two bytes of the code points.