JDK (Java Development Kit) Tutorials
Dr. Herong Yang, Version 5.00

Character Set Encoding Maps - Unicode UTF-16, UTF-16LE, UTF-16BE

This section provides a tutorial example of analyzing and printing character set encoding maps for 3 encoding, UTF-16, UTF-16LE, and UTF-16BE, for Unicode character set.

Here is the output of my sample program, EncodingAnalyzer.java, for UTF-16 encoding:

Code Point > Byte Sequence - Code Point > Byte Sequence

0000 > FE FF 00 00 - 00FF > FE FF 00 FF
0100 > FE FF 01 00 - 01FF > FE FF 01 FF
0200 > FE FF 02 00 - 02FF > FE FF 02 FF
......
D700 > FE FF D7 00 - D7FF > FE FF D7 FF
D800 > FE FF FF FD - DFFF > FE FF FF FD
E000 > FE FF E0 00 - E0FF > FE FF E0 FF
E100 > FE FF E1 00 - E1FF > FE FF E1 FF
E200 > FE FF E2 00 - E2FF > FE FF E2 FF
......
FF00 > FE FF FF 00 - FFFF > FE FF FF FF

The encoding map of UTF-16, which is another encoding used for the Unicode character set, is much simpler than UTF-8:

  • The output sequence is a fixed length, 2 bytes. Note that the leading 0xFEFF is a format flag.
  • It is not backward compatible with US-ASCII.
  • One section of code points is not valid: 0xD800 - 0xDFFF.

Here is the output for UTF-16LE encoding, the little-endian variation of UTF-16 encoding:

Code Point > Byte Sequence - Code Point > Byte Sequence

0000 > 00 00 - D7FF > FF D7
D800 > FD FF - DFFF > FD FF
E000 > 00 E0 - FFFF > FF FF

The encoding map of UTF-16LE is so simple:

  • The output sequence is a fixed length, 2 bytes.
  • It is not backward compatible with US-ASCII.
  • One section of code points is not valid: 0xD800 - 0xDFFF.
  • The rest of the code points is encoded by reversing the two bytes of the code points.

Here is the output for UTF-16BE encoding, the big-endian variation of UTF-16 encoding:

Code Point > Byte Sequence - Code Point > Byte Sequence

0000 > 00 00 - 00FF > 00 FF
0100 > 01 00 - 01FF > 01 FF
0200 > 02 00 - 02FF > 02 FF
......
D700 > D7 00 - D7FF > D7 FF
D800 > FF FD - DFFF > FF FD
E000 > E0 00 - E0FF > E0 FF
E100 > E1 00 - E1FF > E1 FF
E200 > E2 00 - E2FF > E2 FF
......
FF00 > FF 00 - FFFF > FF FF

The encoding map of UTF-16BE is also simple:

  • The output sequence is a fixed length, 2 bytes.
  • It is not backward compatible with US-ASCII.
  • One section of code points is not valid: 0xD800 - 0xDFFF.
  • The rest of the code points is encoded by carbon copying the two bytes of the code points.

Last update: 2006.

Sections in This Chapter

Character Set Encoding Map Analyzer

Character Set Encoding Maps - US-ASCII and ISO-8859-1/Latin 1

Character Set Encoding Maps - CP1252/Windows-1252

Character Set Encoding Maps - Unicode UTF-8

Character Set Encoding Maps - Unicode UTF-16, UTF-16LE, UTF-16BE

Character Counter Program for Any Given Encoding

Character Set Encoding Comparison

Dr. Herong Yang, updated in 2008
Character Set Encoding Maps - Unicode UTF-16, UTF-16LE, UTF-16BE