Unicode Tutorials - Herong's Tutorial Notes
Dr. Herong Yang, Version 5.00

UTF-32BE Encoding

This section provides a quick introduction of the UTF-32BE (Unicode Transformation Format - 32-bit Big Endian) encoding for Unicode character set.

UTF-32BE: A character encoding schema that maps code points of Unicode character set to a sequence of 4 bytes (32 bites). UTF-32BE stands for Unicode Transformation Format - 32-bit Big Endian.

Here is my understanding of the UTF-32BE specification. When UTF-32BE encoding is used to encode (serialize) Unicode characters into a byte stream for communication or storage, the code point of each character will be converted as a 32-bit integer into 4 bytes with the most significant byte first.

For example, these 3 Unicode characters, U+004D, U+0061 and U+10000 will be converted into 0x0000004D0000006100010000 when UTF-32BE is used.

When UTF-32BE encoding is used to decode (deserialize) a byte stream into Unicode characters, the entire stream will be divided into blocks of 4 bytes. Each block is converted to a 32-bit integer to represent a Unicode code point assuming the most significant byte first.

Note that the use of BOM (Byte Order Mark) is not part of the UTF-32BE specification. So you should:

  • Not prepend BOM sequence, 0x0000FEFF, to the output byte stream when encoding.
  • Not treat initial sequence of 0x0000FEFF as BOM when decoding. If it exists, convert the initial 0x0000FEFF sequence as a Unicode character, the ZERO WIDTH NO-BREAK SPACE, U+FEFF, character.

Sections in This Chapter

UTF-32 Encoding

UTF-32BE Encoding

UTF-32LE Encoding

Dr. Herong Yang, updated in 2009
UTF-32BE Encoding