Character.toChars() - "char" Sequence of Code Point

Unicode Tutorials - Herong's Tutorial Examples

∟Character.toChars() - "char" Sequence of Code Point

This section provides tutorial example on how to test 'Character' class toChars() static methods to convert Unicode code points to 'char' sequences, which is really identical to the byte sequences from the UTF-16BE encoding of the code point.

One interesting static method offered in the "Character" class is the "toChars(int codePoint)" method, which always returns "char" sequence for any given Unicode character. It returns 1 "char" if a BMP character is given; and 2 "char"s if a supplementary character is given.

Here is a tutorial example on how to use "toChars()" and other related methods:

/* UnicodeCharacterToChars.java
 * Copyright (c) 2019 HerongYang.com. All Rights Reserved.
 */
import java.io.*;
import java.nio.*;
import java.nio.charset.*;
class UnicodeCharacterToChars {
   static int[] unicodeList = {0x43, 0x2103, 0x1F132, 0x1F1A0, 
      0x20FFFF};
   static char hexDigit[] = {'0', '1', '2', '3', '4', '5', '6', '7',
                             '8', '9', 'A', 'B', 'C', 'D', 'E', 'F'};
   public static void main(String[] arg) {
      try {     
         for (int i=0; i<unicodeList.length; i++) {

// Starting with the code point value
            int codePoint  = unicodeList[i];

// Dumping data in HEX numbers
            System.out.print("\n");
            System.out.print("\n                 Code point: "
               +intToHex(codePoint));

// Getting Unicode character basic properties
            System.out.print("\n                isDefined(): "
               +Character.isDefined(codePoint));
            System.out.print("\n                  getName(): "
               +Character.getName(codePoint));
            System.out.print("\n           isBmpCodePoint(): "
               +Character.isBmpCodePoint(codePoint));
            System.out.print("\n isSupplementaryCodePoint(): "
               +Character.isSupplementaryCodePoint(codePoint));
            System.out.print("\n                charCount(): "
               +Character.charCount(codePoint));

// Getting surrogate char pair
            char charHigh = Character.highSurrogate(codePoint);
            char charLow = Character.lowSurrogate(codePoint);
            System.out.print("\n            highSurrogate(): "
               +charToHex(charHigh));
            System.out.print("\n             lowSurrogate(): "
               +charToHex(charLow));
            System.out.print("\n          isSurrogatePair(): "
               +Character.isSurrogatePair(charHigh, charLow));

// Getting char sequence
            char[] charSeq = Character.toChars(codePoint);
            System.out.print("\n                  toChars():");
            for (int j=0; j<charSeq.length; j++)
               System.out.print(" "+charToHex(charSeq[j]));

// Getting UTF-16BE byte sequence
            int[] intArray = {codePoint};
            String charString = new String(intArray, 0, 1);
            byte[] utf16Seq = charString.getBytes("UTF-16BE");
            System.out.print("\n     UTF-16BE byte sequence:");
            for (int j=0; j<utf16Seq.length; j++)
               System.out.print(" "+byteToHex(utf16Seq[j]));
         }
      } catch (Exception e) {
         System.out.print("\n"+e.toString());
      }
   }
   public static String byteToHex(byte b) {
      char[] a = { hexDigit[(b >> 4) & 0x0f], hexDigit[b & 0x0f] };
      return new String(a);
   }
   public static String charToHex(char c) {
      byte hi = (byte) (c >>> 8);
      byte lo = (byte) (c & 0xff);
      return byteToHex(hi) + byteToHex(lo);
   }
   public static String intToHex(int i) {
      char hi = (char) (i >>> 16);
      char lo = (char) (i & 0xffff);
      return charToHex(hi) + charToHex(lo);
   }
}

Compile and run it with Java 11:

C:\herong>javac UnicodeCharacterToChars.java

C:\herong>java UnicodeCharacterToChars

                 Code point: 00000043
                isDefined(): true
                  getName(): LATIN CAPITAL LETTER C
           isBmpCodePoint(): true
 isSupplementaryCodePoint(): false
                charCount(): 1
            highSurrogate(): D7C0
             lowSurrogate(): DC43
          isSurrogatePair(): false
                  toChars(): 0043
     UTF-16BE byte sequence: 00 43

                 Code point: 00002103
                isDefined(): true
                  getName(): DEGREE CELSIUS
           isBmpCodePoint(): true
 isSupplementaryCodePoint(): false
                charCount(): 1
            highSurrogate(): D7C8
             lowSurrogate(): DD03
          isSurrogatePair(): false
                  toChars(): 2103
     UTF-16BE byte sequence: 21 03

                 Code point: 0001F132
                isDefined(): true
                  getName(): SQUARED LATIN CAPITAL LETTER C
           isBmpCodePoint(): false
 isSupplementaryCodePoint(): true
                charCount(): 2
            highSurrogate(): D83C
             lowSurrogate(): DD32
          isSurrogatePair(): true
                  toChars(): D83C DD32
     UTF-16BE byte sequence: D8 3C DD 32

                 Code point: 0001F1A0
                isDefined(): false
                  getName(): null
           isBmpCodePoint(): false
 isSupplementaryCodePoint(): true
                charCount(): 2
            highSurrogate(): D83C
             lowSurrogate(): DDA0
          isSurrogatePair(): true
                  toChars(): D83C DDA0
     UTF-16BE byte sequence: D8 3C DD A0

                 Code point: 0020FFFF
                isDefined(): false
java.lang.IllegalArgumentException

The output confirms that:

The isDefined(int codePoint) should be the first method to call make sure that the given int value represents a defined Unicode code point.
If isDefined(int codePoint) returns false, stop calling other static methods. Calling Character method with an undefined code point value, may result exceptions.
Java can return the character name for each defined Unicode character.
For BMP characters, highSurrogate(int codePoint) and lowSurrogate(int codePoint) return invalid values.
For supplementary characters, highSurrogate(int codePoint) and lowSurrogate(int codePoint) return a valid surrogate "char" pair.
The toChars(int codePoint) also returns the surrogate "char" pair with high surrogate "char" first for supplementary characters.
The "char" sequence returned by toChars(int codePoint) is identical to the byte sequence returned from the UTF-16BE encoding for both BMP and supplementary characters.