Unicode Tutorials - Herong's Tutorial Examples - v5.32, by Herong Yang
Character.toChars() - "char" Sequence of Code Point
This section provides tutorial example on how to test 'Character' class toChars() static methods to convert Unicode code points to 'char' sequences, which is really identical to the byte sequences from the UTF-16BE encoding of the code point.
One interesting static method offered in the "Character" class is the "toChars(int codePoint)" method, which always returns "char" sequence for any given Unicode character. It returns 1 "char" if a BMP character is given; and 2 "char"s if a supplementary character is given.
Here is a tutorial example on how to use "toChars()" and other related methods:
/* UnicodeCharacterToChars.java * Copyright (c) 2019 HerongYang.com. All Rights Reserved. */ import java.io.*; import java.nio.*; import java.nio.charset.*; class UnicodeCharacterToChars { static int[] unicodeList = {0x43, 0x2103, 0x1F132, 0x1F1A0, 0x20FFFF}; static char hexDigit[] = {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F'}; public static void main(String[] arg) { try { for (int i=0; i<unicodeList.length; i++) { // Starting with the code point value int codePoint = unicodeList[i]; // Dumping data in HEX numbers System.out.print("\n"); System.out.print("\n Code point: " +intToHex(codePoint)); // Getting Unicode character basic properties System.out.print("\n isDefined(): " +Character.isDefined(codePoint)); System.out.print("\n getName(): " +Character.getName(codePoint)); System.out.print("\n isBmpCodePoint(): " +Character.isBmpCodePoint(codePoint)); System.out.print("\n isSupplementaryCodePoint(): " +Character.isSupplementaryCodePoint(codePoint)); System.out.print("\n charCount(): " +Character.charCount(codePoint)); // Getting surrogate char pair char charHigh = Character.highSurrogate(codePoint); char charLow = Character.lowSurrogate(codePoint); System.out.print("\n highSurrogate(): " +charToHex(charHigh)); System.out.print("\n lowSurrogate(): " +charToHex(charLow)); System.out.print("\n isSurrogatePair(): " +Character.isSurrogatePair(charHigh, charLow)); // Getting char sequence char[] charSeq = Character.toChars(codePoint); System.out.print("\n toChars():"); for (int j=0; j<charSeq.length; j++) System.out.print(" "+charToHex(charSeq[j])); // Getting UTF-16BE byte sequence int[] intArray = {codePoint}; String charString = new String(intArray, 0, 1); byte[] utf16Seq = charString.getBytes("UTF-16BE"); System.out.print("\n UTF-16BE byte sequence:"); for (int j=0; j<utf16Seq.length; j++) System.out.print(" "+byteToHex(utf16Seq[j])); } } catch (Exception e) { System.out.print("\n"+e.toString()); } } public static String byteToHex(byte b) { char[] a = { hexDigit[(b >> 4) & 0x0f], hexDigit[b & 0x0f] }; return new String(a); } public static String charToHex(char c) { byte hi = (byte) (c >>> 8); byte lo = (byte) (c & 0xff); return byteToHex(hi) + byteToHex(lo); } public static String intToHex(int i) { char hi = (char) (i >>> 16); char lo = (char) (i & 0xffff); return charToHex(hi) + charToHex(lo); } }
Compile and run it with Java 11:
C:\herong>javac UnicodeCharacterToChars.java C:\herong>java UnicodeCharacterToChars Code point: 00000043 isDefined(): true getName(): LATIN CAPITAL LETTER C isBmpCodePoint(): true isSupplementaryCodePoint(): false charCount(): 1 highSurrogate(): D7C0 lowSurrogate(): DC43 isSurrogatePair(): false toChars(): 0043 UTF-16BE byte sequence: 00 43 Code point: 00002103 isDefined(): true getName(): DEGREE CELSIUS isBmpCodePoint(): true isSupplementaryCodePoint(): false charCount(): 1 highSurrogate(): D7C8 lowSurrogate(): DD03 isSurrogatePair(): false toChars(): 2103 UTF-16BE byte sequence: 21 03 Code point: 0001F132 isDefined(): true getName(): SQUARED LATIN CAPITAL LETTER C isBmpCodePoint(): false isSupplementaryCodePoint(): true charCount(): 2 highSurrogate(): D83C lowSurrogate(): DD32 isSurrogatePair(): true toChars(): D83C DD32 UTF-16BE byte sequence: D8 3C DD 32 Code point: 0001F1A0 isDefined(): false getName(): null isBmpCodePoint(): false isSupplementaryCodePoint(): true charCount(): 2 highSurrogate(): D83C lowSurrogate(): DDA0 isSurrogatePair(): true toChars(): D83C DDA0 UTF-16BE byte sequence: D8 3C DD A0 Code point: 0020FFFF isDefined(): false java.lang.IllegalArgumentException
The output confirms that:
Table of Contents
ASCII Character Set and Encoding
GB2312 Character Set and Encoding
GB18030 Character Set and Encoding
JIS X0208 Character Set and Encodings
UTF-8 (Unicode Transformation Format - 8-Bit)
UTF-16, UTF-16BE and UTF-16LE Encodings
UTF-32, UTF-32BE and UTF-32LE Encodings
Python Language and Unicode Characters
►Java Language and Unicode Characters
Unicode Versions Supported in Java History
'int' and 'String' - Basic Data Types for Unicode
"Character" Class with Unicode Utility Methods
►Character.toChars() - "char" Sequence of Code Point
Character.getNumericValue() - Numeric Value of Code Point
"String" Class with Unicode Utility Methods
String.length() Is Not Number of Characters
String.toCharArray() Returns the UTF-16BE Sequence
String Literals and Source Code Encoding
Encoding Conversion Programs for Encoded Text Files
Using Notepad as a Unicode Text Editor
Using Microsoft Word as a Unicode Text Editor