UTF-16 Encoding

Unicode Tutorials - Herong's Tutorial Examples

∟UTF-16, UTF-16BE and UTF-16LE Encodings

∟UTF-16 Encoding

This section provides a quick introduction of the UTF-16 (Unicode Transformation Format - 16-bit) encoding for Unicode character set. Paired surrogates are used for characters in the U+10000...0x10FFFF range.

Once we learned how to convert Unicode code points in the U+10000...0x10FFFF range into paired surrogates, we are ready to learn how UTF-16 encoding works.

UTF-16: A character encoding that maps code points of Unicode character set to a sequence of 2 bytes (16 bits). UTF-16 stands for Unicode Transformation Format - 16-bit.

Here is my understanding of the UTF-16 specification. When UTF-16 encoding is used to encode (serialize) Unicode characters into a byte stream for communication or storage, there are 3 valid optional formats:

Big-Endian without BOM Format - If the character is in the U+0000...0xFFFF range, convert the code point as an unassigned 16-bit integer into 2 bytes with the most significant byte first. If the character is in the U+10000...0x10FFFF range, convert the character into a surrogate pair, then convert each surrogate into 2 bytes with the most significant byte first.
Big-Endian with BOM Format - Prepend 0xFEFF first. Then convert each character in the same way as the Big-Endian without BOM Format.
Little-Endian with BOM Format - Prepend 0xFFFE first. Then convert each character in the same way as the Big-Endian without BOM Format except that 16-bit integers are converted into 2 bytes with the least significant byte first.

For example, all 3 encoding streams list below are valid UTF-16 encoded streams for 3 Unicode characters, U+004D, U+0061 and U+10000:

Big-Endian Format - 0x004D0061D800DC00
Big-Endian with BOM Format - 0xFEFF004D0061D800DC00
Little-Endian with BOM Format - 0xFFFE4D00610000D800DC

When UTF-16 encoding is used to decode (deserialize) a byte stream into Unicode characters, the following logic should be used:

Step 1 - Read the first 2 bytes.
Step 2a - If the first 2 bytes is 0xFEFF, treat them as BOM (Byte Order Mark), and convert the rest of the byte stream in blocks of 2 bytes. Each block is converted to a 16-bit integer assuming the most significant byte first. Then process the converted integer stream according to Step 3a and 3b.
Step 2b - If the first 2 bytes is 0xFFFE, treat them as BOM (Byte Order Mark), and convert the rest of the byte stream in blocks of 2 bytes. Each block is converted to a 16-bit integer assuming the least significant byte first. Then process the converted integer stream according to Step 3a and 3b.
Step 2c - If the first 2 bytes is not 0xFEFF or 0xFFFE, convert the entire stream, including the first 2 bytes, in blocks of 2 bytes. Each block is converted to a 16-bit integer assuming the most significant byte first. Then process the converted integer stream according to Step 3a and 3b.
Step 3a - If a converted integer is not in the surrogate area, i.e. < 0xD800 or > 0xDFFF, it represent the code point of the decode character.
Step 3b - If a converted integer is in the surrogate area, i.e. >= 0xD800 and <= 0xDFFF, it represent the first surrogate of a surrogate pair. Take the next converted integer as the second surrogate and convert the surrogate pair to a Unicode character in the U+10000...0x10FFFF range.