String.toCharArray() Returns the UTF-16BE Sequence

Unicode Tutorials - Herong's Tutorial Examples

∟String.toCharArray() Returns the UTF-16BE Sequence

This section provides tutorial example on showing that the output of toCharArray() is the same as getBytes('UTF-16BE') at the bit level.

Another way to look at a "String" object is to dump it into a "char" sequence or a "byte" sequence with different encoding algorithms:

/* UnicodeStringEncoding.java
 * Copyright (c) 2019 HerongYang.com. All Rights Reserved.
 */
import java.io.*;
class UnicodeStringEncoding {
   static int[] unicodeList = {0x43, 0x2103, 0x1F132, 0x1F1A0};
   static char hexDigit[] = {'0', '1', '2', '3', '4', '5', '6', '7',
                             '8', '9', 'A', 'B', 'C', 'D', 'E', 'F'};
   public static void main(String[] arg) {
      try {     

// Constructing a String from a list of code points
         int num = unicodeList.length;
         String str = new String(unicodeList, 0, num);

// String length and code point count
         System.out.print("\n # of Unicode characters: "+num);
         System.out.print("\n        codePointCount(): "
            +str.codePointCount(0,str.length()));
         System.out.print("\n                length(): "
            +str.length());

// Getting the char sequence
         char[] charSeq = str.toCharArray();
         System.out.print("\n           toCharArray():");
         printChars(charSeq);

// Getting Unicode encoding sequences
         byte[] byteSeq8 = str.getBytes("UTF-8");
         System.out.print("\n         getBytes(UTF-8):");
         printBytes(byteSeq8);
         byte[] byteSeq16 = str.getBytes("UTF-16BE");
         System.out.print("\n      getBytes(UTF-16BE):");
         printBytes(byteSeq16);
         byte[] byteSeq32 = str.getBytes("UTF-32BE");
         System.out.print("\n      getBytes(UTF-32BE):");
         printBytes(byteSeq32);
      } catch (Exception e) {
         System.out.print("\n"+e.toString());
      }
   }
   public static void printBytes(byte[] b) {
      for (int j=0; j<b.length; j++)
         System.out.print(" "+byteToHex(b[j]));
   }
   public static String byteToHex(byte b) {
      char[] a = { hexDigit[(b >> 4) & 0x0f], hexDigit[b & 0x0f] };
      return new String(a);
   }
   public static void printChars(char[] c) {
      for (int j=0; j<c.length; j++)
         System.out.print(" "+charToHex(c[j]));
   }
   public static String charToHex(char c) {
      byte hi = (byte) (c >>> 8);
      byte lo = (byte) (c & 0xff);
      return byteToHex(hi) + byteToHex(lo);
   }
}

Compile and run it with Java 11:

C:\herong>javac UnicodeStringEncoding.java

C:\herong>java UnicodeStringEncoding
 # of Unicode characters: 4
        codePointCount(): 4
                length(): 6
           toCharArray(): 0043 2103 D83C DD32 D83C DDA0
         getBytes(UTF-8): 43 E2 84 83 F0 9F 84 B2 F0 9F 86 A0
      getBytes(UTF-16BE): 00 43 21 03 D8 3C DD 32 D8 3C DD A0
      getBytes(UTF-32BE): 00 00 00 43 00 00 21 03 00 01 F1 32 00 01...

The output confirms that:

toCharArray() returns the same output as the getByte("UTF-16BE") at the bit level. In other words, Unicode characters are stored in a "String" object as a UTF-16BE encoded "char" sequence.
getByte("UTF-16BE") returns the same output as the original code point value list at the bit level.