UTF-8 Encoding Algorithm
<< UTF-8 (Unicode Transformation Format - 8-Bit)
<< Unicode Tutorials - Herong's Tutorial Notes
This section provides a tutorial example on how to write a programming algorithm to encode characters with UTF-8 encoding.
Here is an algorithm for UTF-8 encoding on a single character:
Input: unsigned integer c - the code point of the character to be encoded Output: byte b1, b2, b3, b4 - the encoded sequence of bytes Algorithm: if (c<0x80) b1 = c>>0 & 0x7F | 0x00 b2 = null b3 = null b4 = null else if (c<0x0800) b1 = c>>6 & 0x1F | 0xC0 b2 = c>>0 & 0x3F | 0x80 b3 = null b4 = null else if (c<0x010000) b1 = c>>12 & 0x0F | 0xE0 b2 = c>>6 & 0x3F | 0x80 b3 = c>>0 & 0x3F | 0x80 b4 = null else if (c<0x110000) b1 = c>>18 & 0x07 | 0xF0 b2 = c>>12 & 0x3F | 0x80 b3 = c>>6 & 0x3F | 0x80 b4 = c>>0 & 0x3F | 0x80 end if
Exercise: Write an algorithm to decode a UTF-8 encoded byte sequence.
Sections in This Chapter
UTF-8 Encoding
Features of UTF-8 Encoding