Unicode Tutorials - Herong's Tutorial Notes
Dr. Herong Yang, Version 5.00

UTF-8 Encoding Algorithm

This section provides a tutorial example on how to write a programming algorithm to encode characters with UTF-8 encoding.

Here is an algorithm for UTF-8 encoding on a single character:

Input: 
   unsigned integer c - the code point of the character to be encoded
Output: 
   byte b1, b2, b3, b4 - the encoded sequence of bytes
Algorithm:
   if (c<0x80) 
      b1 = c>>0  & 0x7F | 0x00
      b2 = null
      b3 = null
      b4 = null
   else if (c<0x0800)
      b1 = c>>6  & 0x1F | 0xC0
      b2 = c>>0  & 0x3F | 0x80
      b3 = null
      b4 = null
   else if (c<0x010000)
      b1 = c>>12 & 0x0F | 0xE0
      b2 = c>>6  & 0x3F | 0x80
      b3 = c>>0  & 0x3F | 0x80
      b4 = null
   else if (c<0x110000)
      b1 = c>>18 & 0x07 | 0xF0
      b2 = c>>12 & 0x3F | 0x80
      b3 = c>>6  & 0x3F | 0x80
      b4 = c>>0  & 0x3F | 0x80
   end if

Exercise: Write an algorithm to decode a UTF-8 encoded byte sequence.

Sections in This Chapter

UTF-8 Encoding

UTF-8 Encoding Algorithm

Features of UTF-8 Encoding

Dr. Herong Yang, updated in 2009
UTF-8 Encoding Algorithm