Unicode Tutorials - Herong's Tutorial Notes
Dr. Herong Yang, Version 5.00

UTF-32 Encoding

This section provides a quick introduction of the UTF-32 (Unicode Transformation Format - 32-bit) encoding for Unicode character set. UTF-32 uses 32 bits or 4 bytes to encode each character.

UTF-32: A character encoding schema that maps code points of Unicode character set to a sequence of 4 bytes (32 bites). UTF-32 stands for Unicode Transformation Format - 32-bit.

Here is my understanding of the UTF-32 specification. When UTF-32 encoding is used to encode (serialize) Unicode characters into a byte stream for communication or storage, there are 3 valid optional formats:

  • Big-Endian without BOM Format - Convert the code point of each character as a 32-bit integer into 4 bytes with the most significant byte first.
  • Big-Endian with BOM Format - Prepend 0x0000FEFF, then convert the code point of each character as a 32-bit integer into 4 bytes with the most significant byte first.
  • Little-Endian with BOM Format - Prepend 0xFFFE0000, then convert the code point of each character as a 32-bit integer into 4 bytes with the least significant byte first.

For example, all 3 encoding streams list below are valid UTF-32 encoded streams for 3 Unicode characters, U+004D, U+0061 and U+10000:

  • Big-Endian Format - 0x0000004D0000006100010000>
  • Big-Endian with BOM Format - 0x0000FEFF0000004D0000006100010000
  • Little-Endian with BOM Format - 0xFFFE00004D0000006100000000000100

When UTF-32 encoding is used to decode (deserialize) a byte stream into Unicode characters, the following logic should be used:

  • Step 1 - Read the first 4 bytes.
  • Step 2a - If the first 4 bytes is 0x0000FEFF, treat them as BOM (Byte Order Mark), and convert the rest of the byte stream in blocks of 4 bytes. Each block is converted to a 32-bit integer to represent a Unicode code point assuming the most significant byte first.
  • Step 2b - If the first 4 bytes is 0xFFFE0000, treat them as BOM (Byte Order Mark), and convert the rest of the byte stream in blocks of 4 bytes. Each block is converted to a 32-bit integer to represent a Unicode code point assuming the least significant byte first.
  • Step 2c - If the first 4 bytes is not 0x0000FEFF or 0xFFFE0000, convert the entire stream, including the first 4 bytes, in blocks of 4 bytes. Each block is converted to a 32-bit integer to represent a Unicode code point assuming the most significant byte first.

As of today, July 2009, there are not many applications that support UTF-32 encoding. I only see Firefox 3.0.11 on my Windows system that supports UTF-32 encoding.

Sections in This Chapter

UTF-32 Encoding

UTF-32BE Encoding

UTF-32LE Encoding

Dr. Herong Yang, updated in 2009
UTF-32 Encoding