JDK Tutorials - Herong's Tutorial Notes
Dr. Herong Yang, Version 4.32, 2006

Encoding Map Counts

Part:   1  2  

JDK Tutorials - Herong's Tutorial Notes © Dr. Herong Yang

Internationalization

Character Set and Encoding

Socket Communication

Document Object Model (DOM)

XSD Validation in Java

XSL - Transformer in Java

JCA - Private and Public Key Pairs

JCE - Secret Key

SSL (Secure Socket Layer)

SSL - Client Authentication

... Table of Contents

(Continued from previous part...)

   public static byte[] encodeByEncoder(char c, String cs) {
      Charset cso = null;
      byte[] b = null;
      try {   	
         cso = Charset.forName(cs);
         CharsetEncoder e =  cso.newEncoder();
         e.reset();
         ByteBuffer bb = e.encode(CharBuffer.wrap(new char[] {c}));
         if (bb.limit()>0) b = copyBytes(bb.array(),bb.limit());
      } catch (IllegalCharsetNameException e) {
         System.out.println(e.toString());
      } catch (CharacterCodingException e) {
         // invalid character, return null
      }      	
      return b;
   }
   public static void printBytes(byte[] b) {
      if (b!=null) {
         for (int j=0; j<b.length; j++)
            System.out.print(" "+byteToHex(b[j]));
      } else {
         System.out.print(" XX");
      }
   }   
   public static byte[] copyBytes(byte[] a, int l) {
      byte[] b = new byte[l];
      for (int i=0; i<Math.min(l,a.length); i++) b[i] = a[i];
      return b;
   }
   public static String byteToHex(byte b) {
      char[] a = { hexDigit[(b >> 4) & 0x0f], hexDigit[b & 0x0f] };
      return new String(a);
   }
   public static String charToHex(char c) {
      byte hi = (byte) (c >>> 8);
      byte lo = (byte) (c & 0xff);
      return byteToHex(hi) + byteToHex(lo);
   }
}

Note that:

  • CharsetEncoder.encode() is used to encode the code points stored as "char" type.
  • Since Java can only encode code points in the 0x0000 - 0xFFFF range, only a subset of the character set will be encoded for some encodings, like UTF-8, which can encode code points up to 0x10FFFF.
  • The encoding name should be specified as command argument.

Run this program with US-ASCII as argument, you will get:

US-ASCII encoding:
0000 > 00 - 007F > 7F = 128
0080 > XX - FFFF > XX = 65408
Total characters = 65536
Valid characters = 128
Invalid characters = 65408

This tells us that the US-ASCII character set has only 128 characters.

Run this program with ISO-8859-1 (Latin 1) as argument, you will get:

ISO-8859-1 encoding:
0000 > 00 - 00FF > FF = 256
0100 > XX - FFFF > XX = 65280
Total characters = 65536
Valid characters = 256
Invalid characters = 65280

This tells us that the US-ASCII character set has only 256 characters.

Comparison of Encoding Maps

The following table is based on the output of the EncodingCouter program with different supported encoding names. It provides a brief comparison between the some different encodings.

Encoding     Map     US-ASCII 
Name         Size    Compatible   Notes
US-ASCII     128     Y   7-bit characters only
ISO-8859-1   256     Y   8-bit (single byte) characters
CP1252       251     Y   One byte output, with code points up to 0x2122
UTF-8        63488   Y   1-3 bytes, 
UTF-16BE     63488   N   2 bytes, carbon copying the code points
UTF-16LE     63488   N   2 bytes, reversing the code points 
UTF-16       63488   N   4 bytes, last 2 bytes = UTF-16BE
GBK          24068   Y   1-2 bytes, Chinese 1993 standard
GB18030      63488   Y   1-4 bytes, superset of GBK, 2000 standard

Part:   1  2  

Dr. Herong Yang, updated in 2006
JDK Tutorials - Herong's Tutorial Notes - Encoding Map Counts