Herong's Tutorial Notes on Unicode - JDK

Herong's Tutorial Notes on Unicode

Dr. Herong Yang, Version 4.02

JDK - Encoding Maps

Part: 1 2 3

This chapter helps you understand:

Encoding Map Counter
Comparison of Encoding Maps

Notes and sample codes bellow are based on J2SDK 1.4.1_01.

Encoding Map Analyzer

As mentioned in my other note, "Character Set and Encoding", J2SDK 1.4.1_01 for Windows 2000 provides 48 build-in encodings.

I have the following program to analyze a given encoding and print a map between the code points (from 0x0000 to 0xFFFF) and the encoded byte sequences:

/**
 * EncodingAnalyzer.java
 * Copyright (c) 2002 by Dr. Herong Yang
 */
import java.io.*;
class EncodingAnalyzer {
   static char hexDigit[] = {'0', '1', '2', '3', '4', '5', '6', '7',
                             '8', '9', 'A', 'B', 'C', 'D', 'E', 'F'};
   public static void main(String[] a) {
      String charset = null;
      if (a.length>0) charset = a[0];
      if (charset==null) System.out.println("Default encoding:");
      else System.out.println(charset+" encoding:");
      int lastByte = 0;
      int lastLength = 0;
      byte[] startSequence = null;
      char startChar = 0;
      byte[] endSequence = null;
      char endChar = 0;
      boolean isFirstChar = true;
      for (int i=0; i<0x010000; i++) {
         char c = (char) i;
         String s = String.valueOf(c);
         byte[] b = null;
         if (charset==null) {
            b = s.getBytes();
         } else {
            try {
               b = s.getBytes(charset);
            } catch (UnsupportedEncodingException e) {
               System.out.println(e.toString());
               break;
            }
         }
         int l = b.length;
         int lb = ((int) b[l-1]) & 0x00FF;
         if (isFirstChar==true) {
            isFirstChar = false;
            startSequence = b;
            startChar = c;
            lastByte = lb - 1;
            lastLength = l;
         }
         if (!(l==lastLength && (lb==lastByte+1 || lb==lastByte))) {
            System.out.print(charToHex(startChar)+" >");
            printBytes(startSequence);
            System.out.print(" - "+charToHex(endChar)+" >");
            printBytes(endSequence);
            System.out.println("");
            startSequence = b;
            startChar = c;
         }
         endSequence = b;
         endChar = c;
         lastLength = l;
         lastByte = lb;
      }
      System.out.print(charToHex(startChar)+" >");
      printBytes(startSequence);
      System.out.print(" - "+charToHex(endChar)+" >");
      printBytes(endSequence);
      System.out.println("");
   }
   public static void printBytes(byte[] b) {
      for (int j=0; j<b.length; j++)
         System.out.print(" "+byteToHex(b[j]));
   }   
   public static String byteToHex(byte b) {
      char[] a = { hexDigit[(b >> 4) & 0x0f], hexDigit[b & 0x0f] };
      return new String(a);
   }
   public static String charToHex(char c) {
      byte hi = (byte) (c >>> 8);
      byte lo = (byte) (c & 0xff);
      return byteToHex(hi) + byteToHex(lo);
   }
}

Note that:

String.getBytes() is used to encode the code points stored as "char" type.
Since Java can only encode code points in the 0x0000 - 0xFFFF range, only a subset of the character set will be encoded for some encodings, like UTF-8, which can encode code points upto 0x10FFFF.
The encoding name should be specified as command argument.

The output of this program will be discussed in the sections bellow.

US-ASCII

US-ASCII encoding:

Code        Code
Point       Point

0000 > 00 - 007F > 7F
0080 > 3F - FFFF > 3F

This is a very simple map.
The encoded byte sequence is one byte only, taking the lower value byte of the code point.
Valid code points only in the 0x0000 - 0x007F range.

(Continued on next part...)

Part: 1 2 3

Dr. Herong Yang, updated in 2007

Herong's Tutorial Notes on Unicode - JDK - Encoding Maps