Unicode Tutorials - Herong's Tutorial Notes
Dr. Herong Yang, Version 5.00

Examples of US-ASCII, UTF-8, UTF-16 and UTF-16BE Encodings

This section provides examples of encoded byte sequences of US-ASCII, UTF-8, UTF-16 and UTF-16BE encodings.

Let's continue to play with the testing program, EncodingSampler.java, provided in the previous section. This time, I want to try US-ASCII encoding:

US-ASCII encoding:
Char, String, Writer, Charset, Encoder
0000, 00, 00, 00, 00
003F, 3F, 3F, 3F, 3F
0040, 40, 40, 40, 40
007F, 7F, 7F, 7F, 7F
0080, 3F, 3F, 3F, 00
00BF, 3F, 3F, 3F, 00
00C0, 3F, 3F, 3F, 00
00FF, 3F, 3F, 3F, 00
0100, 3F, 3F, 3F, 00
3FFF, 3F, 3F, 3F, 00
4000, 3F, 3F, 3F, 00
7FFF, 3F, 3F, 3F, 00
8000, 3F, 3F, 3F, 00
BFFF, 3F, 3F, 3F, 00
C000, 3F, 3F, 3F, 00
EFFF, 3F, 3F, 3F, 00
F000, 3F, 3F, 3F, 00
FFFF, 3F, 3F, 3F, 00

It's obvious that US-ASCII works on a character set in the 0x0000 - 0x007F range.

Now I wan to try UTF-8 encoding to cover characters with code points higher than 0x007F:

UTF-8 encoding:
Char, String, Writer, Charset, Encoder
0000, 00, 00, 00, 00
003F, 3F, 3F, 3F, 3F
0040, 40, 40, 40, 40
007F, 7F, 7F, 7F, 7F
0080, C2 80, C2 80, C2 80, C2 80
00BF, C2 BF, C2 BF, C2 BF, C2 BF
00C0, C3 80, C3 80, C3 80, C3 80
00FF, C3 BF, C3 BF, C3 BF, C3 BF
0100, C4 80, C4 80, C4 80, C4 80
3FFF, E3 BF BF, E3 BF BF, E3 BF BF, E3 BF BF
4000, E4 80 80, E4 80 80, E4 80 80, E4 80 80
7FFF, E7 BF BF, E7 BF BF, E7 BF BF, E7 BF BF
8000, E8 80 80, E8 80 80, E8 80 80, E8 80 80
BFFF, EB BF BF, EB BF BF, EB BF BF, EB BF BF
C000, EC 80 80, EC 80 80, EC 80 80, EC 80 80
EFFF, EE BF BF, EE BF BF, EE BF BF, EE BF BF
F000, EF 80 80, EF 80 80, EF 80 80, EF 80 80
FFFF, EF BF BF, EF BF BF, EF BF BF, EF BF BF

The output matches my expectation.

Let's try another Unicode related encoding, UTF-16:

UTF-16 encoding:
Char, String, Writer, Charset, Encoder
0000, FE FF 00 00, FE FF 00 00, FE FF 00 00, FE FF 00 00
003F, FE FF 00 3F, FE FF 00 3F, FE FF 00 3F, FE FF 00 3F
0040, FE FF 00 40, FE FF 00 40, FE FF 00 40, FE FF 00 40
007F, FE FF 00 7F, FE FF 00 7F, FE FF 00 7F, FE FF 00 7F
0080, FE FF 00 80, FE FF 00 80, FE FF 00 80, FE FF 00 80
00BF, FE FF 00 BF, FE FF 00 BF, FE FF 00 BF, FE FF 00 BF
00C0, FE FF 00 C0, FE FF 00 C0, FE FF 00 C0, FE FF 00 C0
00FF, FE FF 00 FF, FE FF 00 FF, FE FF 00 FF, FE FF 00 FF
0100, FE FF 01 00, FE FF 01 00, FE FF 01 00, FE FF 01 00
3FFF, FE FF 3F FF, FE FF 3F FF, FE FF 3F FF, FE FF 3F FF
4000, FE FF 40 00, FE FF 40 00, FE FF 40 00, FE FF 40 00
7FFF, FE FF 7F FF, FE FF 7F FF, FE FF 7F FF, FE FF 7F FF
8000, FE FF 80 00, FE FF 80 00, FE FF 80 00, FE FF 80 00
BFFF, FE FF BF FF, FE FF BF FF, FE FF BF FF, FE FF BF FF
C000, FE FF C0 00, FE FF C0 00, FE FF C0 00, FE FF C0 00
EFFF, FE FF EF FF, FE FF EF FF, FE FF EF FF, FE FF EF FF
F000, FE FF F0 00, FE FF F0 00, FE FF F0 00, FE FF F0 00
FFFF, FE FF FF FF, FE FF FF FF, FE FF FF FF, FE FF FF FF

This is a surprise to me. Why UTF-16 generates 32-bit sequences? Why not call it UTF32? I found the answer later: 0xFEFF is a BOM (Byte Order Mark) indicates that the following byte sequence is in Big Endian format. In other word, JDK uses the Big-Endian with BOM format for UTF-16 encoding by default.

How about UTF16-BE encoding:

UTF-16BE encoding:
Char, String, Writer, Charset, Encoder
0000, 00 00, 00 00, 00 00, 00 00
003F, 00 3F, 00 3F, 00 3F, 00 3F
0040, 00 40, 00 40, 00 40, 00 40
007F, 00 7F, 00 7F, 00 7F, 00 7F
0080, 00 80, 00 80, 00 80, 00 80
00BF, 00 BF, 00 BF, 00 BF, 00 BF
00C0, 00 C0, 00 C0, 00 C0, 00 C0
00FF, 00 FF, 00 FF, 00 FF, 00 FF
0100, 01 00, 01 00, 01 00, 01 00
3FFF, 3F FF, 3F FF, 3F FF, 3F FF
4000, 40 00, 40 00, 40 00, 40 00
7FFF, 7F FF, 7F FF, 7F FF, 7F FF
8000, 80 00, 80 00, 80 00, 80 00
BFFF, BF FF, BF FF, BF FF, BF FF
C000, C0 00, C0 00, C0 00, C0 00
EFFF, EF FF, EF FF, EF FF, EF FF
F000, F0 00, F0 00, F0 00, F0 00
FFFF, FF FF, FF FF, FF FF, FF FF

The output is perfect.

Sections in This Chapter

What Is Character Encoding?

Supported Character Encodings in JDK 1.4.1

EncodingSampler.java - Testing encode() Methods

Examples of CP1252 and ISO-8859-1 Encodings

Examples of US-ASCII, UTF-8, UTF-16 and UTF-16BE Encodings

Examples of GB18030 Encoding

Testing decode() Methods

Dr. Herong Yang, updated in 2009
Examples of US-ASCII, UTF-8, UTF-16 and UTF-16BE Encodings