String.length() Is Not Number of Characters

This section provides tutorial example on showing the difference between length() and codePointCount() methods. The difference between charAt(int index) and codePointAt(int index) is also demonstrated.

Because Unicode characters are stored in "String" objects as a mixed of single "char" elements and surrogate "char" element pairs, the "char" element index and Unicode character location are difficult to calculate.

Here is a tutorial example to show you this problem:

/* UnicodeStringIndex.java
 * Copyright (c) 2019 HerongYang.com. All Rights Reserved.
 */
import java.io.*;
class UnicodeStringIndex {
   static int[] unicodeList = {0x43, 0x2103, 0x1F132, 0x1F1A0, 
      0x37, 0x0667, 0x2166, 0x3286, 0x4E03, 0x1F108};
   public static void main(String[] arg) {
      try {    

// Constructing a String from a list of code points
         int num = unicodeList.length;
         String str = new String(unicodeList, 0, num);

// String length and code point count
         System.out.print("\n # of Unicode characters: "+num);
         System.out.print("\n        codePointCount(): "
            +str.codePointCount(0,str.length()));
         System.out.print("\n                length(): "
            +str.length());

// String element at a BMP position
         System.out.print("\n               charAt(1): "
            +Integer.toHexString(str.charAt(1)));
         System.out.print("\n          codePointAt(1): "
            +Integer.toHexString(str.codePointAt(1)));

// String element at a high surrogate position
         char high = str.charAt(2);
         System.out.print("\n               charAt(2): "
            +Integer.toHexString(high));
         System.out.print("\n          codePointAt(2): "
            +Integer.toHexString(str.codePointAt(2)));

// String element at a low surrogate position
         char low = str.charAt(3);
         System.out.print("\n               charAt(3): "
            +Integer.toHexString(low));
         System.out.print("\n          codePointAt(3): "
            +Integer.toHexString(str.codePointAt(3)));

// validating the surrogate char pair
         int code = Character.toCodePoint(high, low);
         System.out.print("\n Character.toCodePoint(): "
            +Integer.toHexString(Character.toCodePoint(high, low)));
      } catch (Exception e) {
         System.out.print("\n"+e.toString());
      }
   }
}

Compile and run it with Java 11:

C:\herong>javac UnicodeStringIndex.java

C:\herong>java UnicodeStringIndex
 # of Unicode characters: 10
        codePointCount(): 10
                length(): 13
               charAt(1): 2103
          codePointAt(1): 2103
               charAt(2): d83c
          codePointAt(2): 1f132
               charAt(3): dd32
          codePointAt(3): dd32
 Character.toCodePoint(): 1f132

The output confirms that:

Table of Contents

 About This Book

 Character Sets and Encodings

 ASCII Character Set and Encoding

 GB2312 Character Set and Encoding

 GB18030 Character Set and Encoding

 JIS X0208 Character Set and Encodings

 Unicode Character Set

 UTF-8 (Unicode Transformation Format - 8-Bit)

 UTF-16, UTF-16BE and UTF-16LE Encodings

 UTF-32, UTF-32BE and UTF-32LE Encodings

 Python Language and Unicode Characters

Java Language and Unicode Characters

 Unicode Versions Supported in Java History

 'int' and 'String' - Basic Data Types for Unicode

 "Character" Class with Unicode Utility Methods

 Character.toChars() - "char" Sequence of Code Point

 Character.getNumericValue() - Numeric Value of Code Point

 "String" Class with Unicode Utility Methods

String.length() Is Not Number of Characters

 String.toCharArray() Returns the UTF-16BE Sequence

 String Literals and Source Code Encoding

 Character Encoding in Java

 Character Set Encoding Maps

 Encoding Conversion Programs for Encoded Text Files

 Using Notepad as a Unicode Text Editor

 Using Microsoft Word as a Unicode Text Editor

 Using Microsoft Excel as a Unicode Text Editor

 Unicode Fonts

 Archived Tutorials

 References

 Full Version in PDF/EPUB