This section provides a quick introduction of the UTF-16 (Unicode Transformation Format - 16-bit) encoding for Unicode character set. Paired surrogates are used for characters in the U+10000...0x10FFFF range.
Once we learned how to convert Unicode code points in the U+10000...0x10FFFF range into paired surrogates,
we are ready to learn how UTF-16 encoding works.
UTF-16: A character encoding that maps code points of Unicode character
set to a sequence of 2 bytes (16 bits). UTF-16 stands for Unicode Transformation Format - 16-bit.
Here is my understanding of the UTF-16 specification.
When UTF-16 encoding is used to encode (serialize) Unicode characters into a byte stream for communication or storage,
there are 3 valid optional formats:
Big-Endian without BOM Format - If the character is in the U+0000...0xFFFF range, convert the code point as an unassigned 16-bit integer
into 2 bytes with the most significant byte first. If the character is in the U+10000...0x10FFFF range, convert the character into a surrogate pair,
then convert each surrogate into 2 bytes with the most significant byte first.
Big-Endian with BOM Format - Prepend 0xFEFF first. Then convert each character in the same way as the Big-Endian without BOM Format.
Little-Endian with BOM Format - Prepend 0xFFFE first. Then convert each character in the same way as the Big-Endian without BOM Format
except that 16-bit integers are converted into 2 bytes with the least significant byte first.
For example, all 3 encoding streams list below are valid UTF-16 encoded streams
for 3 Unicode characters, U+004D, U+0061 and U+10000:
Big-Endian Format - 0x004D0061D800DC00
Big-Endian with BOM Format - 0xFEFF004D0061D800DC00
Little-Endian with BOM Format - 0xFFFE4D00610000D800DC
When UTF-16 encoding is used to decode (deserialize) a byte stream into Unicode characters, the following logic should be used:
Step 1 - Read the first 2 bytes.
Step 2a - If the first 2 bytes is 0xFEFF, treat them as BOM (Byte Order Mark), and convert the rest of the byte stream
in blocks of 2 bytes. Each block is converted to a 16-bit integer assuming the most significant byte first.
Then process the converted integer stream according to Step 3a and 3b.
Step 2b - If the first 2 bytes is 0xFFFE, treat them as BOM (Byte Order Mark), and convert the rest of the byte stream
in blocks of 2 bytes. Each block is converted to a 16-bit integer assuming the least significant byte first.
Then process the converted integer stream according to Step 3a and 3b.
Step 2c - If the first 2 bytes is not 0xFEFF or 0xFFFE, convert the entire stream, including the first 2 bytes,
in blocks of 2 bytes. Each block is converted to a 16-bit integer assuming the most significant byte first.
Then process the converted integer stream according to Step 3a and 3b.
Step 3a - If a converted integer is not in the surrogate area, i.e. < 0xD800 or > 0xDFFF, it represent the code
point of the decode character.
Step 3b - If a converted integer is in the surrogate area, i.e. >= 0xD800 and <= 0xDFFF, it represent
the first surrogate of a surrogate pair. Take the next converted integer as the second surrogate and convert the surrogate
pair to a Unicode character in the U+10000...0x10FFFF range.