Herong's Tutorial Notes on Unicode
Dr. Herong Yang, Version 4.02
 Unicode Transformation Formats (UTF) This chapter helps you understand: UTF-8 Encoding UTF-16LE Encoding UTF-16BE Encoding UTF-8 Encoding UTF-8: A character encoding that maps code points of Unicode 3.1 character set to a sequence of variable number of bytes. UTF-8 stands for Unicode Transformation Format - 8-bit. The following table illustrates how UTF-8 encoding works: ```Code Point Encoding Range Byte 1 Byte 2 Byte 3 Byte 4 ``` ```U+000000 - U+00007F 0xxxxxxx U+000080 - U+0007FF 110xxxxx 10xxxxxx U+000800 - U+00FFFF 1110xxxx 10xxxxxx 10xxxxxx U+010000 - U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx ``` Here is an algorithm for UTF-8 encoding on a single character: ```Input: unsigned integer c - the code point of the character to be encoded Output: byte b1, b2, b3, b4 - the encoded sequence of bytes Algorithm: if (c<0x80) b1 = c>>0 & 0x7F | 0x00 b2 = null b3 = null b4 = null else if (c<0x0800) b1 = c>>6 & 0x1F | 0xC0 b2 = c>>0 & 0x3F | 0x80 b3 = null b4 = null else if (c<0x010000) b1 = c>>12 & 0x0F | 0xE0 b2 = c>>6 & 0x3F | 0x80 b3 = c>>0 & 0x3F | 0x80 b4 = null else if (c<0x110000) b1 = c>>18 & 0x07 | 0xF0 b2 = c>>12 & 0x3F | 0x80 b3 = c>>6 & 0x3F | 0x80 b4 = c>>0 & 0x3F | 0x80 end if ``` Features of UTF-8 encoding: Very efficient for characters in Western languages. Compatible with the single byte ASCII encoding. Easy to process a UTF-8 encoded text file. The code points for most of the characters in Western languages are in the range of U+000000 to U+00007F, which will be encoded by a single byte. The rest of characters will be encoded by two types. So, a text in Western language will be encoded with 1.1 bytes per character on average. This is a big saving comparing with the native encode, which always requires 3 bytes per charater. An ASCII encoded text file can be processed as a UTF-8 encoded text file without any changes. Processing a UTF8 encoded text files is relatively easy. For example, if you are looking at one byte of an encoded character in the middle of the file, and want to find out the first byte of this encoded character, you just need to following this simple logic: ``` while (current byte matches the bit pattern '10xxxxxx') { Current byte = previous byte } ``` Exercise: Write an algorithm to decode a UTF-8 encoded byte sequence. UTF-16LE Encoding UTF-16LE: A character encoding that maps code points of Unicode 3.0 character set to a sequence of 2 bytes. UTF-16LE stands for Unicode Transformation Format - 16-bit Little Endian. UTF-16LE maps a code point into 2 bytes by: First mapping the code point into a 16-bit binary integer representation. Creating the first byte with the first 8 bits of the 16-bit representation. Creating the second byte with the last 8 bits of the 16-bit representation. The following table illustrates how UTF-16LE encoding works: ```Code Point Encoding Range Range Byte 1 Byte 2 Byte 1 Byte 2 ``` ```U+0000 - U+00FF 0x00 0x00 - 0x00 0xFF U+0100 - U+FFFF 0x01 0x00 - 0xFF 0xFF ``` Note that: A UTF-16LE encoded plain text file usually starts with a 0xFFFE byte order mark. In MS Word and Excel, UTF-16LE is called Unicode encoding. UTF-16BE Encoding UTF-16BE: A character encoding that maps code points of Unicode 3.0 character set to a sequence of 2 bytes. UTF-16BE stands for Unicode Transformation Format - 16-bit Big Endian. UTF-16BE maps a code point into 2 bytes by: First mapping the code point into a 16-bit binary integer representation. Creating the first byte with the last 8 bits of the 16-bit representation. Creating the second byte with the first 8 bits of the 16-bit representation. The following table illustrates how UTF-16BE encoding works: ```Code Point Encoding Range Range Byte 1 Byte 2 Byte 1 Byte 2 ``` ```U+0000 - U+00FF 0x00 0x00 - 0xFF 0x00 U+0100 - U+FFFF 0x00 0x01 - 0xFF 0xFF ``` Note that: UTF-16BE is also called byte-reversed UTF-16. A UTF-16BE encoded plain text file usually starts with a 0xFEFF byte order mark. MS Word and Excel don't support this encoding. This site Web
Dr. Herong Yang, updated in 2007
Herong's Tutorial Notes on Unicode - Unicode Transformation Formats (UTF)