Herong's Tutorial Notes on Unicode
Dr. Herong Yang, Version 4.02

JDK - Character Set and Encoding

Part:   1  2  3  4 

This chapter helps you understand:

  • What is a Character Encoding
  • Supported Character Encodings
  • Methods to Encode Characters
  • Methods to Decode Byte Sequences

Notes and sample codes bellow are based on J2SDK 1.4.1_01.

What is a Character Encoding

Character Encoding: A map scheme between code points of a code character set and sequences of bytes.

Coded Character Set: A character set in which each character has an assigned integral number.

Code Point: An integral number assigned to a character in a coded character set.

Unicode: A coded character set that contains all characters used in the written languages of the world and special symbols.

As of 1.4.1, J2SDK supports Unicode 3.0, based on the information provided in the reference document of java.lang.Character class.

I am not how JDK is going to support Unicode 3.1, because it now contains characters with code points greater than U+FFFF, which is the maximum value of 'char' type in Java.

Because of the 'char' limitation, JDK can only support encoding and decoding code points in the 16-bit range: U+0000 - U+FFFF.

Supported Character Encodings

JDK uses the java.nio.charset.Charset class to represent a character encoding, with both encode() method and decode() method. It also provides a method, availableCharsets(), to return all supported encodings. Here is a program to display all the supported character encodings:

/**
 * Encodings.java
 * Copyright (c) 2002 by Dr. Herong Yang
 */
import java.nio.charset.*;
import java.util.*;
class Encodings {
   public static void main(String[] arg) {
      SortedMap m = Charset.availableCharsets();
      Set k = m.keySet();
      System.out.println("Canonical name, Display name,"
         +" Can encode, Aliases");
      Iterator i = k.iterator();
      while (i.hasNext()) {
         String n = (String) i.next();
         Charset e = (Charset) m.get(n);
         String d = e.displayName();
         boolean c = e.canEncode();
         System.out.print(n+", "+d+", "+c);
         Set s = e.aliases();
         Iterator j = s.iterator();
         while (j.hasNext()) {
            String a = (String) j.next();         
            System.out.print(", "+a);
         }
         System.out.println("");
      }
   }
}

Output:

Canonical name, Display name, Can encode, Aliases
Big5, Big5, true, csBig5
Big5-HKSCS, Big5-HKSCS, true, big5-hkscs, Big5_HKSCS, big5hkscs
EUC-CN, EUC-CN, true
EUC-JP, EUC-JP, true, eucjis, x-eucjp, csEUCPkdFmtjapanese, eucjp, 
   Extended_UNIX_Code_Packed_Format_for_Japanese, x-euc-jp, euc_jp
euc-jp-linux, euc-jp-linux, true, euc_jp_linux
EUC-KR, EUC-KR, true, ksc5601, 5601, ksc5601_1987, ksc_5601, 
   ksc5601-1987, euc_kr, ks_c_5601-1987, euckr, csEUCKR
EUC-TW, EUC-TW, true, cns11643, euc_tw, euctw
GB18030, GB18030, true, gb18030-2000
GBK, GBK, true, GBK
ISCII91, ISCII91, true, iscii, ST_SEV_358-88, iso-ir-153, 
   csISO153GOST1976874
ISO-2022-CN-CNS, ISO-2022-CN-CNS, true, ISO2022CN_CNS
ISO-2022-CN-GB, ISO-2022-CN-GB, true, ISO2022CN_GB
ISO-2022-KR, ISO-2022-KR, true, ISO2022KR, csISO2022KR
ISO-8859-1, ISO-8859-1, true, iso-ir-100, 8859_1, ISO_8859-1, ISO8859_1,
   819, csISOLatin1, IBM-819, ISO_8859-1:1987, latin1, cp819, ISO8859-1,
   IBM819, ISO_8859_1, l1
ISO-8859-13, ISO-8859-13, true
ISO-8859-15, ISO-8859-15, true, 8859_15, csISOlatin9, IBM923, cp923, 923,
   L9, IBM-923, ISO8859-15, LATIN9, ISO_8859-15, LATIN0, csISOlatin0, 
   ISO8859_15_FDIS, ISO-8859-15
ISO-8859-2, ISO-8859-2, true
ISO-8859-3, ISO-8859-3, true
ISO-8859-4, ISO-8859-4, true
ISO-8859-5, ISO-8859-5, true
ISO-8859-6, ISO-8859-6, true
ISO-8859-7, ISO-8859-7, true
ISO-8859-8, ISO-8859-8, true
ISO-8859-9, ISO-8859-9, true
JIS0201, JIS0201, true, X0201, JIS_X0201, csHalfWidthKatakana
JIS0208, JIS0208, true, JIS_C6626-1983, csISO87JISX0208, x0208, 
   JIS_X0208-1983, iso-ir-87
JIS0212, JIS0212, true, jis_x0212-1990, x0212, iso-ir-159, 
   csISO159JISC02121990
Johab, Johab, true, ms1361, ksc5601_1992, ksc5601-1992
KOI8-R, KOI8-R, true
Shift_JIS, Shift_JIS, true, shift-jis, x-sjis, ms_kanji, shift_jis, 
   csShiftJIS, sjis, pck
TIS-620, TIS-620, true
US-ASCII, US-ASCII, true, IBM367, ISO646-US, ANSI_X3.4-1986, cp367, ASCII,
   iso_646.irv:1983, 646, us, iso-ir-6, csASCII, ANSI_X3.4-1968, 
   ISO_646.irv:1991
UTF-16, UTF-16, true, UTF_16
UTF-16BE, UTF-16BE, true, X-UTF-16BE, UTF_16BE, ISO-10646-UCS-2
UTF-16LE, UTF-16LE, true, UTF_16LE, X-UTF-16LE
UTF-8, UTF-8, true, UTF8
windows-1250, windows-1250, true
windows-1251, windows-1251, true
windows-1252, windows-1252, true, cp1252
windows-1253, windows-1253, true
windows-1254, windows-1254, true
windows-1255, windows-1255, true
windows-1256, windows-1256, true
windows-1257, windows-1257, true
windows-1258, windows-1258, true
windows-936, windows-936, true, ms936, ms_936
windows-949, windows-949, true, ms_949, ms949
windows-950, windows-950, true, ms950

(Continued on next part...)

Part:   1  2  3  4 

Dr. Herong Yang, updated in 2007
Herong's Tutorial Notes on Unicode - JDK - Character Set and Encoding