This section provides a quick introduction of the UTF-32 (Unicode Transformation Format - 32-bit) encoding for Unicode character set. UTF-32 uses 32 bits or 4 bytes to encode each character.
UTF-32: A character encoding schema that maps code points of Unicode character
set to a sequence of 4 bytes (32 bites). UTF-32 stands for Unicode Transformation
Format - 32-bit.
Here is my understanding of the UTF-32 specification.
When UTF-32 encoding is used to encode (serialize) Unicode characters into a byte stream for communication or storage,
there are 3 valid optional formats:
Big-Endian without BOM Format - Convert the code point of each character as a 32-bit integer into 4 bytes with the most significant byte first.
Big-Endian with BOM Format - Prepend 0x0000FEFF, then convert the code point of each character as a 32-bit integer into 4 bytes with the most significant byte first.
Little-Endian with BOM Format - Prepend 0xFFFE0000, then convert the code point of each character as a 32-bit integer into 4 bytes with the least significant byte first.
For example, all 3 encoding streams list below are valid UTF-32 encoded streams
for 3 Unicode characters, U+004D, U+0061 and U+10000:
Big-Endian Format - 0x0000004D0000006100010000>
Big-Endian with BOM Format - 0x0000FEFF0000004D0000006100010000
Little-Endian with BOM Format - 0xFFFE00004D0000006100000000000100
When UTF-32 encoding is used to decode (deserialize) a byte stream into Unicode characters, the following logic should be used:
Step 1 - Read the first 4 bytes.
Step 2a - If the first 4 bytes is 0x0000FEFF, treat them as BOM (Byte Order Mark), and convert the rest of the byte stream
in blocks of 4 bytes. Each block is converted to a 32-bit integer to represent a Unicode code point assuming the most significant byte first.
Step 2b - If the first 4 bytes is 0xFFFE0000, treat them as BOM (Byte Order Mark), and convert the rest of the byte stream
in blocks of 4 bytes. Each block is converted to a 32-bit integer to represent a Unicode code point assuming the least significant byte first.
Step 2c - If the first 4 bytes is not 0x0000FEFF or 0xFFFE0000, convert the entire stream, including the first 4 bytes,
in blocks of 4 bytes. Each block is converted to a 32-bit integer to represent a Unicode code point assuming the most significant byte first.
As of today, July 2009, there are not many applications that support UTF-32 encoding. I only see Firefox 3.0.11 on my
Windows system that supports UTF-32 encoding.