Unicode Transformation Formats (UTF)
This chapter helps you understand:
- UTF-8 Encoding
- UTF-16LE Encoding
- UTF-16BE Encoding
UTF-8 Encoding
UTF-8: A character encoding that maps code points of Unicode 3.1 character
set to a sequence of variable number of bytes. UTF-8 stands for Unicode Transformation
Format - 8-bit.
The following table illustrates how UTF-8 encoding works:
Code Point Encoding
Range Byte 1 Byte 2 Byte 3 Byte 4
U+000000 - U+00007F 0xxxxxxx
U+000080 - U+0007FF 110xxxxx 10xxxxxx
U+000800 - U+00FFFF 1110xxxx 10xxxxxx 10xxxxxx
U+010000 - U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Here is an algorithm for UTF-8 encoding on a single character:
Input:
unsigned integer c - the code point of the character to be encoded
Output:
byte b1, b2, b3, b4 - the encoded sequence of bytes
Algorithm:
if (c<0x80)
b1 = c>>0 & 0x7F | 0x00
b2 = null
b3 = null
b4 = null
else if (c<0x0800)
b1 = c>>6 & 0x1F | 0xC0
b2 = c>>0 & 0x3F | 0x80
b3 = null
b4 = null
else if (c<0x010000)
b1 = c>>12 & 0x0F | 0xE0
b2 = c>>6 & 0x3F | 0x80
b3 = c>>0 & 0x3F | 0x80
b4 = null
else if (c<0x110000)
b1 = c>>18 & 0x07 | 0xF0
b2 = c>>12 & 0x3F | 0x80
b3 = c>>6 & 0x3F | 0x80
b4 = c>>0 & 0x3F | 0x80
end if
Features of UTF-8 encoding:
- Very efficient for characters in Western languages.
- Compatible with the single byte ASCII encoding.
- Easy to process a UTF-8 encoded text file.
The code points for most of the characters in Western languages are in the
range of U+000000 to U+00007F, which will be encoded by a single byte.
The rest of characters will be encoded by two types. So, a text in Western
language will be encoded with 1.1 bytes per character on average.
This is a big saving comparing with the native encode, which always requires
3 bytes per charater.
An ASCII encoded text file can be processed as a UTF-8 encoded text file without
any changes.
Processing a UTF8 encoded text files is relatively easy. For example, if you
are looking at one byte of an encoded character in the middle of the file,
and want to find out the first byte of this encoded character, you just need
to following this simple logic:
while (current byte matches the bit pattern '10xxxxxx') {
Current byte = previous byte
}
Exercise: Write an algorithm to decode a UTF-8 encoded byte sequence.
UTF-16LE Encoding
UTF-16LE: A character encoding that maps code points of Unicode 3.0 character
set to a sequence of 2 bytes. UTF-16LE stands for Unicode Transformation
Format - 16-bit Little Endian.
UTF-16LE maps a code point into 2 bytes by:
- First mapping the code point into a 16-bit binary integer representation.
- Creating the first byte with the first 8 bits of the 16-bit representation.
- Creating the second byte with the last 8 bits of the 16-bit representation.
The following table illustrates how UTF-16LE encoding works:
Code Point Encoding Range
Range Byte 1 Byte 2 Byte 1 Byte 2
U+0000 - U+00FF 0x00 0x00 - 0x00 0xFF
U+0100 - U+FFFF 0x01 0x00 - 0xFF 0xFF
Note that:
- A UTF-16LE encoded plain text file usually starts with a 0xFFFE byte order mark.
- In MS Word and Excel, UTF-16LE is called Unicode encoding.
UTF-16BE Encoding
UTF-16BE: A character encoding that maps code points of Unicode 3.0 character
set to a sequence of 2 bytes. UTF-16BE stands for Unicode Transformation
Format - 16-bit Big Endian.
UTF-16BE maps a code point into 2 bytes by:
- First mapping the code point into a 16-bit binary integer representation.
- Creating the first byte with the last 8 bits of the 16-bit representation.
- Creating the second byte with the first 8 bits of the 16-bit representation.
The following table illustrates how UTF-16BE encoding works:
Code Point Encoding Range
Range Byte 1 Byte 2 Byte 1 Byte 2
U+0000 - U+00FF 0x00 0x00 - 0xFF 0x00
U+0100 - U+FFFF 0x00 0x01 - 0xFF 0xFF
Note that:
- UTF-16BE is also called byte-reversed UTF-16.
- A UTF-16BE encoded plain text file usually starts with a 0xFEFF byte order mark.
- MS Word and Excel don't support this encoding.
|