Unicode Tutorials - Herong's Tutorial Examples - v5.32, by Herong Yang
String.toCharArray() Returns the UTF-16BE Sequence
This section provides tutorial example on showing that the output of toCharArray() is the same as getBytes('UTF-16BE') at the bit level.
Another way to look at a "String" object is to dump it into a "char" sequence or a "byte" sequence with different encoding algorithms:
/* UnicodeStringEncoding.java * Copyright (c) 2019 HerongYang.com. All Rights Reserved. */ import java.io.*; class UnicodeStringEncoding { static int[] unicodeList = {0x43, 0x2103, 0x1F132, 0x1F1A0}; static char hexDigit[] = {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F'}; public static void main(String[] arg) { try { // Constructing a String from a list of code points int num = unicodeList.length; String str = new String(unicodeList, 0, num); // String length and code point count System.out.print("\n # of Unicode characters: "+num); System.out.print("\n codePointCount(): " +str.codePointCount(0,str.length())); System.out.print("\n length(): " +str.length()); // Getting the char sequence char[] charSeq = str.toCharArray(); System.out.print("\n toCharArray():"); printChars(charSeq); // Getting Unicode encoding sequences byte[] byteSeq8 = str.getBytes("UTF-8"); System.out.print("\n getBytes(UTF-8):"); printBytes(byteSeq8); byte[] byteSeq16 = str.getBytes("UTF-16BE"); System.out.print("\n getBytes(UTF-16BE):"); printBytes(byteSeq16); byte[] byteSeq32 = str.getBytes("UTF-32BE"); System.out.print("\n getBytes(UTF-32BE):"); printBytes(byteSeq32); } catch (Exception e) { System.out.print("\n"+e.toString()); } } public static void printBytes(byte[] b) { for (int j=0; j<b.length; j++) System.out.print(" "+byteToHex(b[j])); } public static String byteToHex(byte b) { char[] a = { hexDigit[(b >> 4) & 0x0f], hexDigit[b & 0x0f] }; return new String(a); } public static void printChars(char[] c) { for (int j=0; j<c.length; j++) System.out.print(" "+charToHex(c[j])); } public static String charToHex(char c) { byte hi = (byte) (c >>> 8); byte lo = (byte) (c & 0xff); return byteToHex(hi) + byteToHex(lo); } }
Compile and run it with Java 11:
C:\herong>javac UnicodeStringEncoding.java C:\herong>java UnicodeStringEncoding # of Unicode characters: 4 codePointCount(): 4 length(): 6 toCharArray(): 0043 2103 D83C DD32 D83C DDA0 getBytes(UTF-8): 43 E2 84 83 F0 9F 84 B2 F0 9F 86 A0 getBytes(UTF-16BE): 00 43 21 03 D8 3C DD 32 D8 3C DD A0 getBytes(UTF-32BE): 00 00 00 43 00 00 21 03 00 01 F1 32 00 01...
The output confirms that:
Table of Contents
ASCII Character Set and Encoding
GB2312 Character Set and Encoding
GB18030 Character Set and Encoding
JIS X0208 Character Set and Encodings
UTF-8 (Unicode Transformation Format - 8-Bit)
UTF-16, UTF-16BE and UTF-16LE Encodings
UTF-32, UTF-32BE and UTF-32LE Encodings
Python Language and Unicode Characters
►Java Language and Unicode Characters
Unicode Versions Supported in Java History
'int' and 'String' - Basic Data Types for Unicode
"Character" Class with Unicode Utility Methods
Character.toChars() - "char" Sequence of Code Point
Character.getNumericValue() - Numeric Value of Code Point
"String" Class with Unicode Utility Methods
String.length() Is Not Number of Characters
►String.toCharArray() Returns the UTF-16BE Sequence
String Literals and Source Code Encoding
Encoding Conversion Programs for Encoded Text Files
Using Notepad as a Unicode Text Editor
Using Microsoft Word as a Unicode Text Editor