Unicode Tutorials - Herong's Tutorial Examples - v5.32, by Herong Yang
UTF-8 Encoding
This section provides a quick introduction of the UTF-8 (Unicode Transformation Format - 8-bit) encoding for Unicode character set. It uses 1, 2, 3, or 4 bytes for each character.
UTF-8: A character encoding that maps code points of Unicode character set to a sequence of 1 byte (8 bits). UTF-8 stands for Unicode Transformation Format - 8-bit.
Here is my understanding of the UTF-8 specification. When UTF-8 encoding is used to encode (serialize) Unicode characters into a byte stream for communication or storage, the following logic should be used:
The above logic can also be summarized in a table like this:
Binary Format and Split Bytes Code Point Range Byte 1 Byte 2 Byte 3 Byte 4 U+000000...U+00007F 0bxxxxxxx 0b0xxxxxxx U+000080...U+0007FF 0byyyyyxxxxxx 0b110yyyyy, 0b10xxxxxx U+000800...U+00FFFF 0bzzzzyyyyyyxxxxxx 0b1110zzzz, 0b10yyyyyy, 0b10xxxxxx U+010000...U+10FFFF 0bvvvzzzzzzyyyyyyxxxxxx 0b11110vvv, 0b10zzzzzz, 0b10yyyyyy, 0b10xxxxxx
For example, these 3 Unicode characters, U+004D, U+0061 and U+10000 will be converted into 0x4D61F0908080 when UTF-8 is used.
Table of Contents
ASCII Character Set and Encoding
GB2312 Character Set and Encoding
GB18030 Character Set and Encoding
JIS X0208 Character Set and Encodings
►UTF-8 (Unicode Transformation Format - 8-Bit)
UTF-16, UTF-16BE and UTF-16LE Encodings
UTF-32, UTF-32BE and UTF-32LE Encodings
Python Language and Unicode Characters
Java Language and Unicode Characters
Encoding Conversion Programs for Encoded Text Files
Using Notepad as a Unicode Text Editor
Using Microsoft Word as a Unicode Text Editor