Herong's Tutorial Notes on Unicode
Dr. Herong Yang, Version 4.02

JDK - Character Set and Encoding

Part:   1  2  3  4 

(Continued from previous part...)

Let's try an encoding that is designed for the Unicode character set, UTF-8:

UTF-8 encoding:
Char, String, Writer, Charset, Encoder
0000, 00, 00, 00, 00
003F, 3F, 3F, 3F, 3F
0040, 40, 40, 40, 40
007F, 7F, 7F, 7F, 7F
0080, C2 80, C2 80, C2 80, C2 80
00BF, C2 BF, C2 BF, C2 BF, C2 BF
00C0, C3 80, C3 80, C3 80, C3 80
00FF, C3 BF, C3 BF, C3 BF, C3 BF
0100, C4 80, C4 80, C4 80, C4 80
3FFF, E3 BF BF, E3 BF BF, E3 BF BF, E3 BF BF
4000, E4 80 80, E4 80 80, E4 80 80, E4 80 80
7FFF, E7 BF BF, E7 BF BF, E7 BF BF, E7 BF BF
8000, E8 80 80, E8 80 80, E8 80 80, E8 80 80
BFFF, EB BF BF, EB BF BF, EB BF BF, EB BF BF
C000, EC 80 80, EC 80 80, EC 80 80, EC 80 80
EFFF, EE BF BF, EE BF BF, EE BF BF, EE BF BF
F000, EF 80 80, EF 80 80, EF 80 80, EF 80 80
FFFF, EF BF BF, EF BF BF, EF BF BF, EF BF BF

UTF-8 generates multiple bytes sequences, starting with one byte (8 bits).

Let's try another Unicode related encoding, UTF-16:

UTF-16 encoding:
Char, String, Writer, Charset, Encoder
0000, FE FF 00 00, FE FF 00 00, FE FF 00 00, FE FF 00 00
003F, FE FF 00 3F, FE FF 00 3F, FE FF 00 3F, FE FF 00 3F
0040, FE FF 00 40, FE FF 00 40, FE FF 00 40, FE FF 00 40
007F, FE FF 00 7F, FE FF 00 7F, FE FF 00 7F, FE FF 00 7F
0080, FE FF 00 80, FE FF 00 80, FE FF 00 80, FE FF 00 80
00BF, FE FF 00 BF, FE FF 00 BF, FE FF 00 BF, FE FF 00 BF
00C0, FE FF 00 C0, FE FF 00 C0, FE FF 00 C0, FE FF 00 C0
00FF, FE FF 00 FF, FE FF 00 FF, FE FF 00 FF, FE FF 00 FF
0100, FE FF 01 00, FE FF 01 00, FE FF 01 00, FE FF 01 00
3FFF, FE FF 3F FF, FE FF 3F FF, FE FF 3F FF, FE FF 3F FF
4000, FE FF 40 00, FE FF 40 00, FE FF 40 00, FE FF 40 00
7FFF, FE FF 7F FF, FE FF 7F FF, FE FF 7F FF, FE FF 7F FF
8000, FE FF 80 00, FE FF 80 00, FE FF 80 00, FE FF 80 00
BFFF, FE FF BF FF, FE FF BF FF, FE FF BF FF, FE FF BF FF
C000, FE FF C0 00, FE FF C0 00, FE FF C0 00, FE FF C0 00
EFFF, FE FF EF FF, FE FF EF FF, FE FF EF FF, FE FF EF FF
F000, FE FF F0 00, FE FF F0 00, FE FF F0 00, FE FF F0 00
FFFF, FE FF FF FF, FE FF FF FF, FE FF FF FF, FE FF FF FF

This is a surprise to me. Why UTF-16 generates 32-bit sequenences? Why not call it UTF32? I found the answer later on: 0xFEFF is a flag indicates that the following byte sequence is in UTF-16BE (Big Endian) format.

How about encoding, UTF16-BE:

UTF-16BE encoding:
Char, String, Writer, Charset, Encoder
0000, 00 00, 00 00, 00 00, 00 00
003F, 00 3F, 00 3F, 00 3F, 00 3F
0040, 00 40, 00 40, 00 40, 00 40
007F, 00 7F, 00 7F, 00 7F, 00 7F
0080, 00 80, 00 80, 00 80, 00 80
00BF, 00 BF, 00 BF, 00 BF, 00 BF
00C0, 00 C0, 00 C0, 00 C0, 00 C0
00FF, 00 FF, 00 FF, 00 FF, 00 FF
0100, 01 00, 01 00, 01 00, 01 00
3FFF, 3F FF, 3F FF, 3F FF, 3F FF
4000, 40 00, 40 00, 40 00, 40 00
7FFF, 7F FF, 7F FF, 7F FF, 7F FF
8000, 80 00, 80 00, 80 00, 80 00
BFFF, BF FF, BF FF, BF FF, BF FF
C000, C0 00, C0 00, C0 00, C0 00
EFFF, EF FF, EF FF, EF FF, EF FF
F000, F0 00, F0 00, F0 00, F0 00
FFFF, FF FF, FF FF, FF FF, FF FF

This seems to be the perfect encoding, output seems to be identical to input.

Let's try an encoding related to Chinese characters, GB18030:

GB18030 encoding:
Char, String, Writer, Charset, Encoder
0000, 00, 00, 00, 00
003F, 3F, 3F, 3F, 3F
0040, 40, 40, 40, 40
007F, 7F, 7F, 7F, 7F
0080, 81 30 81 30, 81 30 81 30, 81 30 81 30, 81 30 81 30
00BF, 81 30 86 37, 81 30 86 37, 81 30 86 37, 81 30 86 37
00C0, 81 30 86 38, 81 30 86 38, 81 30 86 38, 81 30 86 38
00FF, 81 30 8B 37, 81 30 8B 37, 81 30 8B 37, 81 30 8B 37
0100, 81 30 8B 38, 81 30 8B 38, 81 30 8B 38, 81 30 8B 38
3FFF, 82 32 A6 36, 82 32 A6 36, 82 32 A6 36, 82 32 A6 36
4000, 82 32 A6 37, 82 32 A6 37, 82 32 A6 37, 82 32 A6 37
7FFF, C2 52, C2 52, C2 52, C2 52
8000, D2 AB, D2 AB, D2 AB, D2 AB
BFFF, 83 31 D7 34, 83 31 D7 34, 83 31 D7 34, 83 31 D7 34
C000, 83 31 D7 35, 83 31 D7 35, 83 31 D7 35, 83 31 D7 35
EFFF, 83 38 96 36, 83 38 96 36, 83 38 96 36, 83 38 96 36
F000, 83 38 96 37, 83 38 96 37, 83 38 96 37, 83 38 96 37
FFFF, 84 31 A4 39, 84 31 A4 39, 84 31 A4 39, 84 31 A4 39

It looks complicate.

I think that's enough. You can run the program with any of the supported encodings as an argument yourself.

Methods to Decode Byte Sequences

There are 4 methods to decode characters:

  • CharsetDecoder.decode()
  • Charset.decode()
  • new String()
  • InputStreamReader.read()

The ways to use those methods are similar to the encode methods.

Exercise: Find out what is the default 'Charset' used in the 'String' class.

Source: Herong's Notes on JDK.

Part:   1  2  3  4 

Dr. Herong Yang, updated in 2007
Herong's Tutorial Notes on Unicode - JDK - Character Set and Encoding