This section provides a quick introduction of the UTF-8 (Unicode Transformation Format - 8-bit) encoding for Unicode character set. It uses 1, 2, 3, or 4 bytes for each character.
UTF-8: A character encoding that maps code points of Unicode character
set to a sequence of 1 byte (8 bits). UTF-8 stands for Unicode Transformation Format - 8-bit.
Here is my understanding of the UTF-8 specification.
When UTF-8 encoding is used to encode (serialize) Unicode characters into a byte stream for communication or storage,
the following logic should be used:
If a code point is the U+0000...U+007F range, it can be viewed as a 7-bit integer, 0bxxxxxxx. Map the code point into
1 byte with the first high order bit set to 0 as: B1 = 0b0xxxxxx.
If a code point is the U+0080...U+07FF range, it can be viewed as a 11-bit integer, 0byyyyyxxxxxx.
Map the code point into 2 bytes with first 5 bits stored in the first byte and last 6 bits in the second byte:
as: B1 = 0b110yyyyy, B2 = 0b10xxxxxx.
If a code point is the U+0800...U+FFFF range, it can be viewed as a 16-bit integer, 0bzzzzyyyyyyxxxxxx.
Map the code point into 3 bytes with first 4 bits stored in the first byte, next 6 bits in the second byte,
and last 6 bits in the third byte:
as: B1 = 0b1110zzzz, B2 = 0b10yyyyyy, B3 = 0b10xxxxxx.
If a code point is the U+10000...U+10FFFF range, it can be viewed as a 21-bit integer, 0bvvvzzzzzzyyyyyyxxxxxx.
Map the code point into 4 bytes with first 3 bits stored in the first byte, next 6 bits in the second byte,
another 6 bits in the third byte, and last 6 bits in the fourth byte:
as: B1 = 0b11110xxx, B2 = 0b10zzzzzz, B3 = 0b10yyyyyy, B4 = 0b10xxxxxx.
The above logic can also be summarized in a table like this:
Binary Format and Split Bytes
Code Point Range Byte 1 Byte 2 Byte 3 Byte 4
U+000000...U+00007F 0bxxxxxxx
0b0xxxxxxx
U+000080...U+0007FF 0byyyyyxxxxxx
0b110yyyyy, 0b10xxxxxx
U+000800...U+00FFFF 0bzzzzyyyyyyxxxxxx
0b1110zzzz, 0b10yyyyyy, 0b10xxxxxx
U+010000...U+10FFFF 0bvvvzzzzzzyyyyyyxxxxxx
0b11110vvv, 0b10zzzzzz, 0b10yyyyyy, 0b10xxxxxx
For example, these 3 Unicode characters, U+004D, U+0061 and U+10000 will be
converted into 0x4D61F0908080 when UTF-8 is used.