Unicode Tutorials - Herong's Tutorial Notes
Dr. Herong Yang, Version 5.00

Character Set Encoding Maps - Unicode UTF-8

This section provides a tutorial example of analyzing and printing character set encoding maps for encoding: UTF-8, the most popular encoding for Unicode character set.

Here is the output of my sample program, EncodingAnalyzer.java, for UTF-8 encoding:

Code Point > Byte Sequence - Code Point > Byte Sequence

0000 > 00 - 007F > 7F
0080 > C2 80 - 00BF > C2 BF
00C0 > C3 80 - 00FF > C3 BF
0100 > C4 80 - 013F > C4 BF
......
07C0 > DF 80 - 07FF > DF BF
0800 > E0 A0 80 - 083F > E0 A0 BF
0840 > E0 A1 80 - 087F > E0 A1 BF
0880 > E0 A2 80 - 08BF > E0 A2 BF
......
0FC0 > E0 BF 80 - 0FFF > E0 BF BF
1000 > E1 80 80 - 103F > E1 80 BF
1040 > E1 81 80 - 107F > E1 81 BF
1080 > E1 82 80 - 10BF > E1 82 BF
......
1FC0 > E1 BF 80 - 1FFF > E1 BF BF
2000 > E2 80 80 - 203F > E2 80 BF
2040 > E2 81 80 - 207F > E2 81 BF
2080 > E2 82 80 - 20BF > E2 82 BF
......
2FC0 > E2 BF 80 - 2FFF > E2 BF BF
3000 > E3 80 80 - 303F > E3 80 BF
3040 > E3 81 80 - 307F > E3 81 BF
3080 > E3 82 80 - 30BF > E3 82 BF
......
3FC0 > E3 BF 80 - 3FFF > E3 BF BF
4000 > E4 80 80 - 403F > E4 80 BF
4040 > E4 81 80 - 407F > E4 81 BF
4080 > E4 82 80 - 40BF > E4 82 BF
......
4FC0 > E4 BF 80 - 4FFF > E4 BF BF
5000 > E5 80 80 - 503F > E5 80 BF
5040 > E5 81 80 - 507F > E5 81 BF
5080 > E5 82 80 - 50BF > E5 82 BF
......
5FC0 > E5 BF 80 - 5FFF > E5 BF BF
6000 > E6 80 80 - 603F > E6 80 BF
6040 > E6 81 80 - 607F > E6 81 BF
6080 > E6 82 80 - 60BF > E6 82 BF
......
6FC0 > E6 BF 80 - 6FFF > E6 BF BF
7000 > E7 80 80 - 703F > E7 80 BF
7040 > E7 81 80 - 707F > E7 81 BF
7080 > E7 82 80 - 70BF > E7 82 BF
......
7FC0 > E7 BF 80 - 7FFF > E7 BF BF
8000 > E8 80 80 - 803F > E8 80 BF
8040 > E8 81 80 - 807F > E8 81 BF
8080 > E8 82 80 - 80BF > E8 82 BF
......
8FC0 > E8 BF 80 - 8FFF > E8 BF BF
9000 > E9 80 80 - 903F > E9 80 BF
9040 > E9 81 80 - 907F > E9 81 BF
9080 > E9 82 80 - 90BF > E9 82 BF
......
9FC0 > E9 BF 80 - 9FFF > E9 BF BF
A000 > EA 80 80 - A03F > EA 80 BF
A040 > EA 81 80 - A07F > EA 81 BF
A080 > EA 82 80 - A0BF > EA 82 BF
......
AFC0 > EA BF 80 - AFFF > EA BF BF
B000 > EB 80 80 - B03F > EB 80 BF
B040 > EB 81 80 - B07F > EB 81 BF
B080 > EB 82 80 - B0BF > EB 82 BF
......
BFC0 > EB BF 80 - BFFF > EB BF BF
C000 > EC 80 80 - C03F > EC 80 BF
C040 > EC 81 80 - C07F > EC 81 BF
C080 > EC 82 80 - C0BF > EC 82 BF
......
CFC0 > EC BF 80 - CFFF > EC BF BF
D000 > ED 80 80 - D03F > ED 80 BF
D040 > ED 81 80 - D07F > ED 81 BF
D080 > ED 82 80 - D0BF > ED 82 BF
......
D7C0 > ED 9F 80 - D7FF > ED 9F BF
D800 > 3F - DFFF > 3F
E000 > EE 80 80 - E03F > EE 80 BF
E040 > EE 81 80 - E07F > EE 81 BF
E080 > EE 82 80 - E0BF > EE 82 BF
......
EFC0 > EE BF 80 - EFFF > EE BF BF
F000 > EF 80 80 - F03F > EF 80 BF
F040 > EF 81 80 - F07F > EF 81 BF
F080 > EF 82 80 - F0BF > EF 82 BF
......
FFC0 > EF BF 80 - FFFF > EF BF BF

The encoding map of UTF-8, which is the most popular encodings used for the Unicode character set, is complex:

  • The output sequence has variable number of bytes.
  • It is backward compatible with US-ASCII.
  • This map only valid for Unicode 3.0 and older versions. So it is a partial UTF-8 map.
  • One section of code points is not valid: 0xD800 - 0xDFFF. This invalid section is called the surrogate area.

Last update: 2006.

Sections in This Chapter

Character Set Encoding Map Analyzer

Character Set Encoding Maps - US-ASCII and ISO-8859-1/Latin 1

Character Set Encoding Maps - CP1252/Windows-1252

Character Set Encoding Maps - Unicode UTF-8

Character Set Encoding Maps - Unicode UTF-16, UTF-16LE, UTF-16BE

Character Counter Program for Any Given Encoding

Character Set Encoding Comparison

Dr. Herong Yang, updated in 2009
Character Set Encoding Maps - Unicode UTF-8