JDK (Java Development Kit) Tutorials
Dr. Herong Yang, Version 5.00

Running EncodingSampler.java with UTF-8, UTF-16, UTF16-BE

This section provides a tutorial example on how to run the character encoding sample program with UTF-8, UTF-16, and UTF16-BE encodings, which are all Unicode character set encodings.

I think we are ready to try an encoding that is designed for the Unicode character set, UTF-8:

UTF-8 encoding:
Char, String, Writer, Charset, Encoder
0000, 00, 00, 00, 00
003F, 3F, 3F, 3F, 3F
0040, 40, 40, 40, 40
007F, 7F, 7F, 7F, 7F
0080, C2 80, C2 80, C2 80, C2 80
00BF, C2 BF, C2 BF, C2 BF, C2 BF
00C0, C3 80, C3 80, C3 80, C3 80
00FF, C3 BF, C3 BF, C3 BF, C3 BF
0100, C4 80, C4 80, C4 80, C4 80
3FFF, E3 BF BF, E3 BF BF, E3 BF BF, E3 BF BF
4000, E4 80 80, E4 80 80, E4 80 80, E4 80 80
7FFF, E7 BF BF, E7 BF BF, E7 BF BF, E7 BF BF
8000, E8 80 80, E8 80 80, E8 80 80, E8 80 80
BFFF, EB BF BF, EB BF BF, EB BF BF, EB BF BF
C000, EC 80 80, EC 80 80, EC 80 80, EC 80 80
EFFF, EE BF BF, EE BF BF, EE BF BF, EE BF BF
F000, EF 80 80, EF 80 80, EF 80 80, EF 80 80
FFFF, EF BF BF, EF BF BF, EF BF BF, EF BF BF

UTF-8 generates multiple bytes sequences, starting with one byte (8 bits).

The second test is for another Unicode related encoding, UTF-16:

UTF-16 encoding:
Char, String, Writer, Charset, Encoder
0000, FE FF 00 00, FE FF 00 00, FE FF 00 00, FE FF 00 00
003F, FE FF 00 3F, FE FF 00 3F, FE FF 00 3F, FE FF 00 3F
0040, FE FF 00 40, FE FF 00 40, FE FF 00 40, FE FF 00 40
007F, FE FF 00 7F, FE FF 00 7F, FE FF 00 7F, FE FF 00 7F
0080, FE FF 00 80, FE FF 00 80, FE FF 00 80, FE FF 00 80
00BF, FE FF 00 BF, FE FF 00 BF, FE FF 00 BF, FE FF 00 BF
00C0, FE FF 00 C0, FE FF 00 C0, FE FF 00 C0, FE FF 00 C0
00FF, FE FF 00 FF, FE FF 00 FF, FE FF 00 FF, FE FF 00 FF
0100, FE FF 01 00, FE FF 01 00, FE FF 01 00, FE FF 01 00
3FFF, FE FF 3F FF, FE FF 3F FF, FE FF 3F FF, FE FF 3F FF
4000, FE FF 40 00, FE FF 40 00, FE FF 40 00, FE FF 40 00
7FFF, FE FF 7F FF, FE FF 7F FF, FE FF 7F FF, FE FF 7F FF
8000, FE FF 80 00, FE FF 80 00, FE FF 80 00, FE FF 80 00
BFFF, FE FF BF FF, FE FF BF FF, FE FF BF FF, FE FF BF FF
C000, FE FF C0 00, FE FF C0 00, FE FF C0 00, FE FF C0 00
EFFF, FE FF EF FF, FE FF EF FF, FE FF EF FF, FE FF EF FF
F000, FE FF F0 00, FE FF F0 00, FE FF F0 00, FE FF F0 00
FFFF, FE FF FF FF, FE FF FF FF, FE FF FF FF, FE FF FF FF

This is a surprise to me. Why UTF-16 generates 32-bit sequences? Why not call it UTF32?

I found the answer later: the first 16 bits, 0xEFFF, is not part of the encoding sequence. It is actually a format flag indicating that the following byte sequence is in UTF-16BE (Big Endian) format.

Here is the result of the third test on another Unicode encoding, UTF16-BE:

UTF-16BE encoding:
Char, String, Writer, Charset, Encoder
0000, 00 00, 00 00, 00 00, 00 00
003F, 00 3F, 00 3F, 00 3F, 00 3F
0040, 00 40, 00 40, 00 40, 00 40
007F, 00 7F, 00 7F, 00 7F, 00 7F
0080, 00 80, 00 80, 00 80, 00 80
00BF, 00 BF, 00 BF, 00 BF, 00 BF
00C0, 00 C0, 00 C0, 00 C0, 00 C0
00FF, 00 FF, 00 FF, 00 FF, 00 FF
0100, 01 00, 01 00, 01 00, 01 00
3FFF, 3F FF, 3F FF, 3F FF, 3F FF
4000, 40 00, 40 00, 40 00, 40 00
7FFF, 7F FF, 7F FF, 7F FF, 7F FF
8000, 80 00, 80 00, 80 00, 80 00
BFFF, BF FF, BF FF, BF FF, BF FF
C000, C0 00, C0 00, C0 00, C0 00
EFFF, EF FF, EF FF, EF FF, EF FF
F000, F0 00, F0 00, F0 00, F0 00
FFFF, FF FF, FF FF, FF FF, FF FF

This seems to be the perfect encoding, output seems to be identical to input.

Last update: 2006.

Sections in This Chapter

What Is Character Encoding?

Supported Character Encodings in JDK

Charset.encode() - Method to Encode Characters

Running EncodingSampler.java with CP1252 Encoding

Running EncodingSampler.java with ISO-8859-1 and US-ASCII

Running EncodingSampler.java with UTF-8, UTF-16, UTF16-BE

Running EncodingSampler.java with GB18030

Charset.decode() - Method to Decode Byte Sequences

Dr. Herong Yang, updated in 2008
Running EncodingSampler.java with UTF-8, UTF-16, UTF16-BE