Unicode Tutorials - Herong's Tutorial Notes
Dr. Herong Yang, Version 5.00

UTF-16BE Encoding

This section provides a quick introduction of the UTF-16BE (Unicode Transformation Format - 16-bit Big Endian) encoding for Unicode character set. UTF-16BE is a variation of UTF-16.

UTF-16BE: A character encoding that maps code points of Unicode character set to a sequence of 2 bytes (16 bits). UTF-16BE stands for Unicode Transformation Format - 16-bit Big Endian.

Here is my understanding of the UTF-16BE specification. When UTF-16BE encoding is used to encode (serialize) Unicode characters into a byte stream for communication or storage, the resulting byte stream is identical to the Big-Endian without BOM Format of the UTF-16 encoding.

For example, these 3 Unicode characters, U+004D, U+0061 and U+10000 will be converted into 0x004D0061D800DC00 when UTF-16BE is used.

When UTF-16BE encoding is used to decode (deserialize) a byte stream into Unicode characters, the entire stream will be divided into blocks of 2 bytes. Each block is converted to a 16-bit integer assuming the most significant byte first. Then process the converted integer stream as described below:

  • If a converted integer is not in the surrogate area, i.e. < 0xD800 or > 0xDFFF, it represent the code point of the decode character.
  • If a converted integer is in the surrogate area, i.e. >= 0xD800 and <= 0xDFFF, it represent the first surrogate of a surrogate pair. Take the next converted integer as the second surrogate and convert the surrogate pair to a Unicode character in the U+10000...0x10FFFF range.

Note that the use of BOM (Byte Order Mark) is not part of the UTF-16BE specification. So you should:

  • Not prepend BOM sequence, 0xFEFF, to the output byte stream when encoding.
  • Not treat initial sequence of 0xFEFF as BOM when decoding. If it exists, convert the initial 0xFEFF sequence as a Unicode character, the ZERO WIDTH NO-BREAK SPACE, U+FEFF, character.

Sections in This Chapter

What Are Paired Surrogates?

UTF-16 Encoding

UTF-16BE Encoding

UTF-16LE Encoding

Dr. Herong Yang, updated in 2009
UTF-16BE Encoding