Herong's Tutorial Notes on Unicode
Dr. Herong Yang, Version 4.02

JDK - Character Set and Encoding

Part:   1  2  3  4 

(Continued from previous part...)

Note that:

  • If the same encoding is used, each of the encode method in the program should return the exactly the same byte sequence.
  • getEncoding() is used on OuputStreamWriter class to get the name of the default encoding.
  • There is now way to know the name of the default encoding on String class.
  • There is no default instance of Charset and Encoder.
  • In encodeByEncoder(), 0x00 is used as the output when the given character can not be encoded by the encoder.

Running this program without any argument will use the JVM's default encoding:

Default (Cp1252) encoding:
Char, String, Writer, Charset, Encoder
0000, 00, 00, 00, 00
003F, 3F, 3F, 3F, 3F
0040, 40, 40, 40, 40
007F, 7F, 7F, 7F, 7F
0080, 3F, 3F, 3F, 00
00BF, BF, BF, BF, BF
00C0, C0, C0, C0, C0
00FF, FF, FF, FF, FF
0100, 3F, 3F, 3F, 00
3FFF, 3F, 3F, 3F, 00
4000, 3F, 3F, 3F, 00
7FFF, 3F, 3F, 3F, 00
8000, 3F, 3F, 3F, 00
BFFF, 3F, 3F, 3F, 00
C000, 3F, 3F, 3F, 00
EFFF, 3F, 3F, 3F, 00
F000, 3F, 3F, 3F, 00
FFFF, 3F, 3F, 3F, 00

The results shows that:

  • The default encoding of the String class seems to be the same as OutputStreamWriter: Cp1252.
  • There are a number of characters that can not be encoded by Cp1252. The String, OutputStreamWriter, and Charset classes are returning 0x3F for those non-encodable characters.
  • It's obvious that Cp1252 works on a character set in the 0x0000 - 0x00FF range.

Running the program again with 'CP1252' as argument should give us the same output as the previous run:

CP1252 encoding:
Char, String, Writer, Charset, Encoder
0000, 00, 00, 00, 00
003F, 3F, 3F, 3F, 3F
0040, 40, 40, 40, 40
007F, 7F, 7F, 7F, 7F
0080, 3F, 3F, 3F, 00
00BF, BF, BF, BF, BF
00C0, C0, C0, C0, C0
00FF, FF, FF, FF, FF
0100, 3F, 3F, 3F, 00
3FFF, 3F, 3F, 3F, 00
4000, 3F, 3F, 3F, 00
7FFF, 3F, 3F, 3F, 00
8000, 3F, 3F, 3F, 00
BFFF, 3F, 3F, 3F, 00
C000, 3F, 3F, 3F, 00
EFFF, 3F, 3F, 3F, 00
F000, 3F, 3F, 3F, 00
FFFF, 3F, 3F, 3F, 00

Let's try another encoding, ISO-8859-1:

ISO-8859-1 encoding:
Char, String, Writer, Charset, Encoder
0000, 00, 00, 00, 00
003F, 3F, 3F, 3F, 3F
0040, 40, 40, 40, 40
007F, 7F, 7F, 7F, 7F
0080, 80, 80, 80, 80
00BF, BF, BF, BF, BF
00C0, C0, C0, C0, C0
00FF, FF, FF, FF, FF
0100, 3F, 3F, 3F, 00
3FFF, 3F, 3F, 3F, 00
4000, 3F, 3F, 3F, 00
7FFF, 3F, 3F, 3F, 00
8000, 3F, 3F, 3F, 00
BFFF, 3F, 3F, 3F, 00
C000, 3F, 3F, 3F, 00
EFFF, 3F, 3F, 3F, 00
F000, 3F, 3F, 3F, 00
FFFF, 3F, 3F, 3F, 00

It appears to be the same as CP1252.

Let's try another one, US-ASCII:

US-ASCII encoding:
Char, String, Writer, Charset, Encoder
0000, 00, 00, 00, 00
003F, 3F, 3F, 3F, 3F
0040, 40, 40, 40, 40
007F, 7F, 7F, 7F, 7F
0080, 3F, 3F, 3F, 00
00BF, 3F, 3F, 3F, 00
00C0, 3F, 3F, 3F, 00
00FF, 3F, 3F, 3F, 00
0100, 3F, 3F, 3F, 00
3FFF, 3F, 3F, 3F, 00
4000, 3F, 3F, 3F, 00
7FFF, 3F, 3F, 3F, 00
8000, 3F, 3F, 3F, 00
BFFF, 3F, 3F, 3F, 00
C000, 3F, 3F, 3F, 00
EFFF, 3F, 3F, 3F, 00
F000, 3F, 3F, 3F, 00
FFFF, 3F, 3F, 3F, 00

It's obvious that US-ASCII works on a character set in the 0x0000 - 0x007F range.

(Continued on next part...)

Part:   1  2  3  4 

Dr. Herong Yang, updated in 2007
Herong's Tutorial Notes on Unicode - JDK - Character Set and Encoding