UTF-32 Encoding

Unicode Tutorials - Herong's Tutorial Examples

∟UTF-32, UTF-32BE and UTF-32LE Encodings

∟UTF-32 Encoding

This section provides a quick introduction of the UTF-32 (Unicode Transformation Format - 32-bit) encoding for Unicode character set. UTF-32 uses 32 bits or 4 bytes to encode each character.

UTF-32: A character encoding schema that maps code points of Unicode character set to a sequence of 4 bytes (32 bites). UTF-32 stands for Unicode Transformation Format - 32-bit.

Here is my understanding of the UTF-32 specification. When UTF-32 encoding is used to encode (serialize) Unicode characters into a byte stream for communication or storage, there are 3 valid optional formats:

Big-Endian without BOM Format - Convert the code point of each character as a 32-bit integer into 4 bytes with the most significant byte first.
Big-Endian with BOM Format - Prepend 0x0000FEFF, then convert the code point of each character as a 32-bit integer into 4 bytes with the most significant byte first.
Little-Endian with BOM Format - Prepend 0xFFFE0000, then convert the code point of each character as a 32-bit integer into 4 bytes with the least significant byte first.

For example, all 3 encoding streams list below are valid UTF-32 encoded streams for 3 Unicode characters, U+004D, U+0061 and U+10000:

Big-Endian Format - 0x0000004D0000006100010000>
Big-Endian with BOM Format - 0x0000FEFF0000004D0000006100010000
Little-Endian with BOM Format - 0xFFFE00004D0000006100000000000100

When UTF-32 encoding is used to decode (deserialize) a byte stream into Unicode characters, the following logic should be used:

Step 1 - Read the first 4 bytes.
Step 2a - If the first 4 bytes is 0x0000FEFF, treat them as BOM (Byte Order Mark), and convert the rest of the byte stream in blocks of 4 bytes. Each block is converted to a 32-bit integer to represent a Unicode code point assuming the most significant byte first.
Step 2b - If the first 4 bytes is 0xFFFE0000, treat them as BOM (Byte Order Mark), and convert the rest of the byte stream in blocks of 4 bytes. Each block is converted to a 32-bit integer to represent a Unicode code point assuming the least significant byte first.
Step 2c - If the first 4 bytes is not 0x0000FEFF or 0xFFFE0000, convert the entire stream, including the first 4 bytes, in blocks of 4 bytes. Each block is converted to a 32-bit integer to represent a Unicode code point assuming the most significant byte first.

As of today, July 2009, there are not many applications that support UTF-32 encoding. I only see Firefox 3.0.11 on my Windows system that supports UTF-32 encoding.