Examples of US-ASCII, UTF-8, UTF-16 and UTF-16BE Encodings
Examples of US-ASCII, UTF-8, UTF-16 and UTF-16BE Encodings
This section provides examples of encoded byte sequences of US-ASCII, UTF-8, UTF-16 and UTF-16BE encodings.
Java, Character, US-ASCII, UTF-8, UTF-16, UTF-16BE, Encoding
Examples of US-ASCII, UTF-8, UTF-16 and UTF-16BE Encodings
This section provides examples of encoded byte sequences of US-ASCII, UTF-8, UTF-16 and UTF-16BE encodings.
Let's continue to play with the testing program, EncodingSampler.java, provided in the previous section. This time, I want to try US-ASCII encoding:
C:\herong>java EncodingSampler US-ASCII US-ASCII encoding: Char, String, Writer, Charset, Encoder 0000, 00, 00, 00, 00 003F, 3F, 3F, 3F, 3F 0040, 40, 40, 40, 40 007F, 7F, 7F, 7F, 7F 0080, 3F, 3F, 3F, 00 00BF, 3F, 3F, 3F, 00 00C0, 3F, 3F, 3F, 00 00FF, 3F, 3F, 3F, 00 0100, 3F, 3F, 3F, 00 3FFF, 3F, 3F, 3F, 00 4000, 3F, 3F, 3F, 00 7FFF, 3F, 3F, 3F, 00 8000, 3F, 3F, 3F, 00 BFFF, 3F, 3F, 3F, 00 C000, 3F, 3F, 3F, 00 EFFF, 3F, 3F, 3F, 00 F000, 3F, 3F, 3F, 00 FFFF, 3F, 3F, 3F, 00
It's obvious that US-ASCII works on a character set in the 0x0000 - 0x007F range.
Now I wan to try UTF-8 encoding to cover characters with code points higher than 0x007F:
C:\herong>java EncodingSampler UTF-8 UTF-8 encoding: Char, String, Writer, Charset, Encoder 0000, 00, 00, 00, 00 003F, 3F, 3F, 3F, 3F 0040, 40, 40, 40, 40 007F, 7F, 7F, 7F, 7F 0080, C2 80, C2 80, C2 80, C2 80 00BF, C2 BF, C2 BF, C2 BF, C2 BF 00C0, C3 80, C3 80, C3 80, C3 80 00FF, C3 BF, C3 BF, C3 BF, C3 BF 0100, C4 80, C4 80, C4 80, C4 80 3FFF, E3 BF BF, E3 BF BF, E3 BF BF, E3 BF BF 4000, E4 80 80, E4 80 80, E4 80 80, E4 80 80 7FFF, E7 BF BF, E7 BF BF, E7 BF BF, E7 BF BF 8000, E8 80 80, E8 80 80, E8 80 80, E8 80 80 BFFF, EB BF BF, EB BF BF, EB BF BF, EB BF BF C000, EC 80 80, EC 80 80, EC 80 80, EC 80 80 EFFF, EE BF BF, EE BF BF, EE BF BF, EE BF BF F000, EF 80 80, EF 80 80, EF 80 80, EF 80 80 FFFF, EF BF BF, EF BF BF, EF BF BF, EF BF BF
The output matches my expectation.
Let's try another Unicode related encoding, UTF-16:
C:\herong>java EncodingSampler UTF-16 UTF-16 encoding: Char, String, Writer, Charset, Encoder 0000, FE FF 00 00, FE FF 00 00, FE FF 00 00, FE FF 00 00 003F, FE FF 00 3F, FE FF 00 3F, FE FF 00 3F, FE FF 00 3F 0040, FE FF 00 40, FE FF 00 40, FE FF 00 40, FE FF 00 40 007F, FE FF 00 7F, FE FF 00 7F, FE FF 00 7F, FE FF 00 7F 0080, FE FF 00 80, FE FF 00 80, FE FF 00 80, FE FF 00 80 00BF, FE FF 00 BF, FE FF 00 BF, FE FF 00 BF, FE FF 00 BF 00C0, FE FF 00 C0, FE FF 00 C0, FE FF 00 C0, FE FF 00 C0 00FF, FE FF 00 FF, FE FF 00 FF, FE FF 00 FF, FE FF 00 FF 0100, FE FF 01 00, FE FF 01 00, FE FF 01 00, FE FF 01 00 3FFF, FE FF 3F FF, FE FF 3F FF, FE FF 3F FF, FE FF 3F FF 4000, FE FF 40 00, FE FF 40 00, FE FF 40 00, FE FF 40 00 7FFF, FE FF 7F FF, FE FF 7F FF, FE FF 7F FF, FE FF 7F FF 8000, FE FF 80 00, FE FF 80 00, FE FF 80 00, FE FF 80 00 BFFF, FE FF BF FF, FE FF BF FF, FE FF BF FF, FE FF BF FF C000, FE FF C0 00, FE FF C0 00, FE FF C0 00, FE FF C0 00 EFFF, FE FF EF FF, FE FF EF FF, FE FF EF FF, FE FF EF FF F000, FE FF F0 00, FE FF F0 00, FE FF F0 00, FE FF F0 00 FFFF, FE FF FF FF, FE FF FF FF, FE FF FF FF, FE FF FF FF
This is a surprise to me. Why UTF-16 generates 32-bit sequences? Why not call it UTF32? I found the answer later: 0xFEFF is a BOM (Byte Order Mark) indicates that the following byte sequence is in Big Endian format. In other word, JDK uses the Big-Endian with BOM format for UTF-16 encoding by default.
How about UTF16-BE encoding:
C:\herong>java EncodingSampler UTF-16BE UTF-16BE encoding: Char, String, Writer, Charset, Encoder 0000, 00 00, 00 00, 00 00, 00 00 003F, 00 3F, 00 3F, 00 3F, 00 3F 0040, 00 40, 00 40, 00 40, 00 40 007F, 00 7F, 00 7F, 00 7F, 00 7F 0080, 00 80, 00 80, 00 80, 00 80 00BF, 00 BF, 00 BF, 00 BF, 00 BF 00C0, 00 C0, 00 C0, 00 C0, 00 C0 00FF, 00 FF, 00 FF, 00 FF, 00 FF 0100, 01 00, 01 00, 01 00, 01 00 3FFF, 3F FF, 3F FF, 3F FF, 3F FF 4000, 40 00, 40 00, 40 00, 40 00 7FFF, 7F FF, 7F FF, 7F FF, 7F FF 8000, 80 00, 80 00, 80 00, 80 00 BFFF, BF FF, BF FF, BF FF, BF FF C000, C0 00, C0 00, C0 00, C0 00 EFFF, EF FF, EF FF, EF FF, EF FF F000, F0 00, F0 00, F0 00, F0 00 FFFF, FF FF, FF FF, FF FF, FF FF
The output is perfect.
Last update: 2009.