Herong's Tutorial Notes on Unicode - Unicode Transformation Formats (UTF)

Herong's Tutorial Notes on Unicode

Dr. Herong Yang, Version 4.02

Unicode Transformation Formats (UTF)

This chapter helps you understand:

UTF-8 Encoding
UTF-16LE Encoding
UTF-16BE Encoding

UTF-8 Encoding

UTF-8: A character encoding that maps code points of Unicode 3.1 character set to a sequence of variable number of bytes. UTF-8 stands for Unicode Transformation Format - 8-bit.

The following table illustrates how UTF-8 encoding works:

Code Point            Encoding
Range                 Byte 1   Byte 2   Byte 3   Byte 4

U+000000 - U+00007F   0xxxxxxx
U+000080 - U+0007FF   110xxxxx 10xxxxxx
U+000800 - U+00FFFF   1110xxxx 10xxxxxx 10xxxxxx
U+010000 - U+10FFFF   11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Here is an algorithm for UTF-8 encoding on a single character:

Input: 
   unsigned integer c - the code point of the character to be encoded
Output: 
   byte b1, b2, b3, b4 - the encoded sequence of bytes
Algorithm:
   if (c<0x80) 
      b1 = c>>0  & 0x7F | 0x00
      b2 = null
      b3 = null
      b4 = null
   else if (c<0x0800)
      b1 = c>>6  & 0x1F | 0xC0
      b2 = c>>0  & 0x3F | 0x80
      b3 = null
      b4 = null
   else if (c<0x010000)
      b1 = c>>12 & 0x0F | 0xE0
      b2 = c>>6  & 0x3F | 0x80
      b3 = c>>0  & 0x3F | 0x80
      b4 = null
   else if (c<0x110000)
      b1 = c>>18 & 0x07 | 0xF0
      b2 = c>>12 & 0x3F | 0x80
      b3 = c>>6  & 0x3F | 0x80
      b4 = c>>0  & 0x3F | 0x80
   end if

Features of UTF-8 encoding:

Very efficient for characters in Western languages.
Compatible with the single byte ASCII encoding.
Easy to process a UTF-8 encoded text file.

The code points for most of the characters in Western languages are in the range of U+000000 to U+00007F, which will be encoded by a single byte. The rest of characters will be encoded by two types. So, a text in Western language will be encoded with 1.1 bytes per character on average. This is a big saving comparing with the native encode, which always requires 3 bytes per charater.

An ASCII encoded text file can be processed as a UTF-8 encoded text file without any changes.

Processing a UTF8 encoded text files is relatively easy. For example, if you are looking at one byte of an encoded character in the middle of the file, and want to find out the first byte of this encoded character, you just need to following this simple logic:

   while (current byte matches the bit pattern '10xxxxxx') {
      Current byte = previous byte
   }

Exercise: Write an algorithm to decode a UTF-8 encoded byte sequence.

UTF-16LE Encoding

UTF-16LE: A character encoding that maps code points of Unicode 3.0 character set to a sequence of 2 bytes. UTF-16LE stands for Unicode Transformation Format - 16-bit Little Endian.

UTF-16LE maps a code point into 2 bytes by:

First mapping the code point into a 16-bit binary integer representation.
Creating the first byte with the first 8 bits of the 16-bit representation.
Creating the second byte with the last 8 bits of the 16-bit representation.

The following table illustrates how UTF-16LE encoding works:

Code Point        Encoding Range
Range             Byte 1   Byte 2     Byte 1   Byte 2

U+0000 - U+00FF   0x00     0x00    -  0x00     0xFF
U+0100 - U+FFFF   0x01     0x00    -  0xFF     0xFF

Note that:

A UTF-16LE encoded plain text file usually starts with a 0xFFFE byte order mark.
In MS Word and Excel, UTF-16LE is called Unicode encoding.

UTF-16BE Encoding

UTF-16BE: A character encoding that maps code points of Unicode 3.0 character set to a sequence of 2 bytes. UTF-16BE stands for Unicode Transformation Format - 16-bit Big Endian.

UTF-16BE maps a code point into 2 bytes by:

First mapping the code point into a 16-bit binary integer representation.
Creating the first byte with the last 8 bits of the 16-bit representation.
Creating the second byte with the first 8 bits of the 16-bit representation.

The following table illustrates how UTF-16BE encoding works:

Code Point        Encoding Range
Range             Byte 1   Byte 2     Byte 1   Byte 2

U+0000 - U+00FF   0x00     0x00    -  0xFF     0x00
U+0100 - U+FFFF   0x00     0x01    -  0xFF     0xFF

Note that:

UTF-16BE is also called byte-reversed UTF-16.
A UTF-16BE encoded plain text file usually starts with a 0xFEFF byte order mark.
MS Word and Excel don't support this encoding.

Dr. Herong Yang, updated in 2007

Herong's Tutorial Notes on Unicode - Unicode Transformation Formats (UTF)