Unicode Tutorials - Herong's Tutorial Notes
Dr. Herong Yang, Version 5.00

UTF-8 Encoding

This section provides a quick introduction of the UTF-8 (Unicode Transformation Format - 8-bit) encoding for Unicode character set. It uses 1, 2, 3, or 4 bytes for each character.

UTF-8: A character encoding that maps code points of Unicode character set to a sequence of 1 byte (8 bits). UTF-8 stands for Unicode Transformation Format - 8-bit.

Here is my understanding of the UTF-8 specification. When UTF-8 encoding is used to encode (serialize) Unicode characters into a byte stream for communication or storage, the following logic should be used:

  • If a code point is the U+0000...U+007F range, it can be viewed as a 7-bit integer, 0bxxxxxxx. Map the code point into 1 byte with the first high order bit set to 0 as: B1 = 0b0xxxxxx.
  • If a code point is the U+0080...U+07FF range, it can be viewed as a 11-bit integer, 0byyyyyxxxxxx. Map the code point into 2 bytes with first 5 bits stored in the first byte and last 6 bits in the second byte: as: B1 = 0b110yyyyy, B2 = 0b10xxxxxx.
  • If a code point is the U+0800...U+FFFF range, it can be viewed as a 16-bit integer, 0bzzzzyyyyyyxxxxxx. Map the code point into 3 bytes with first 4 bits stored in the first byte, next 6 bits in the second byte, and last 6 bits in the third byte: as: B1 = 0b1110zzzz, B2 = 0b10yyyyyy, B3 = 0b10xxxxxx.
  • If a code point is the U+10000...U+10FFFF range, it can be viewed as a 21-bit integer, 0bvvvzzzzzzyyyyyyxxxxxx. Map the code point into 4 bytes with first 3 bits stored in the first byte, next 6 bits in the second byte, another 6 bits in the third byte, and last 6 bits in the fourth byte: as: B1 = 0b11110xxx, B2 = 0b10zzzzzz, B3 = 0b10yyyyyy, B4 = 0b10xxxxxx.

The above logic can also be summarized in a table like this:

                      Binary Format and Split Bytes
Code Point Range      Byte 1      Byte 2      Byte 3      Byte 4

U+000000...U+00007F   0bxxxxxxx
                      0b0xxxxxxx

U+000080...U+0007FF   0byyyyyxxxxxx
                      0b110yyyyy, 0b10xxxxxx

U+000800...U+00FFFF   0bzzzzyyyyyyxxxxxx 
                      0b1110zzzz, 0b10yyyyyy, 0b10xxxxxx
                      
U+010000...U+10FFFF   0bvvvzzzzzzyyyyyyxxxxxx 
                      0b11110vvv, 0b10zzzzzz, 0b10yyyyyy, 0b10xxxxxx

For example, these 3 Unicode characters, U+004D, U+0061 and U+10000 will be converted into 0x4D61F0908080 when UTF-8 is used.

Sections in This Chapter

UTF-8 Encoding

UTF-8 Encoding Algorithm

Features of UTF-8 Encoding

Dr. Herong Yang, updated in 2009
UTF-8 Encoding