String Literals and Source Code Encoding

This section provides tutorial example on how to represent non-ASCII characters in UTF-8 encoding byte sequences as part of String literals in the Java source code.

In previous tutorials, we have learned how to represent non-ASCII characters in \uXXXX escape sequences as part of String literals in Java source code.

In this tutorial, we will learn how to represent non-ASCII characters in UTF-8 encoding byte sequences as part of String literals in Java source code.

Here is our test string that contains 2 Non-ASCII characters:

Delicious food U+1F60B takes time U+23F3

Where: 
   U+1F60B: FACE SAVOURING DELICIOUS FOOD
   U+23F3: HOURGLASS WITH FLOWING SAND

Our test string should be displayed like this, if you have the correct Unicode font installed on your computer.

Delicious food  takes time 

In our first test program, we will continue to use \uXXXX sequences in our source code. Note that U+1F60B character needs to be encoded as a surrogate pair of \uD83D\uDE0B based on the UTF-16 encoding rule.

/* UnicodeStringLiterals.java
 * Copyright (c) 2019 HerongYang.com. All Rights Reserved.
 */
class UnicodeStringLiterals {
   public static void main(String[] arg) {
      try {
         String str = "Delicious food \uD83D\uDE0B takes time \u23F3";
         System.out.print("\ncodePointCount(): "
            +str.codePointCount(0,str.length()));
         System.out.print("\n        length(): "
            +str.length());
         System.out.print("\n     String dump: ");
         printString(str);
      } catch (Exception e) {
         System.out.print("\n"+e.toString());
      }
   }
   public static void printString(String s) {
      char[] chars = s.toCharArray();
      for (char c : chars) {
         byte hi = (byte) (c >>> 8);
         byte lo = (byte) (c & 0xff);
         System.out.print(String.format("%02X%02X ", hi, lo));
      }
   }
}

Compile and run it with Java 11:

C:\herong>javac UnicodeStringLiterals.java

C:\herong>java UnicodeStringLiterals
codePointCount(): 29
        length(): 30
     String dump: 0044 0065 006C 0069 0063 0069 006F 0075 0073 0020 
                  0066 006F 006F 0064 0020 D83D DE0B 0020 0074 0061 
                  006B 0065 0073 0020 0074 0069 006D 0065 0020 23F3

In our second test program, we will continue to use UTF-8 encoding byte sequences in our source code. This program is definitely better than the first program, because you can actually see non-ASCII characters displayed in the source code.

/* UnicodeStringLiteralsUTF8.java
 * Copyright (c) 2019 HerongYang.com. All Rights Reserved.
 */
import java.io.*;
class UnicodeStringLiteralsUTF8 {
   public static void main(String[] arg) {
      try {
         String str = "Delicious food 😋 takes time ";
         System.out.print("\ncodePointCount(): "
            +str.codePointCount(0,str.length()));
         System.out.print("\n        length(): "
            +str.length());
         System.out.print("\n     String dump: ");
         printString(str);
      } catch (Exception e) {
         System.out.print("\n"+e.toString());
      }
   }
   public static void printString(String s) {
      char[] chars = s.toCharArray();
      for (char c : chars) {
         byte hi = (byte) (c >>> 8);
         byte lo = (byte) (c & 0xff);
         System.out.print(String.format("%02X%02X ", hi, lo));
      }
   }
}

This time, we need to make sure that UnicodeStringLiteralsUTF8.java is saved as a UTF-8 encoding file and compile with the "-encoding UTF8" option:

C:\herong>javac -encoding UTF8 UnicodeStringLiteralsUTF8.java

C:\herong>java UnicodeStringLiteralsUTF8
codePointCount(): 29
        length(): 30
     String dump: 0044 0065 006C 0069 0063 0069 006F 0075 0073 0020 
                  0066 006F 006F 0064 0020 D83D DE0B 0020 0074 0061 
                  006B 0065 0073 0020 0074 0069 006D 0065 0020 23F3

The output is identical to the first program. This proves that we have properly represented non-ASCII characters in UTF-8 encoding byte sequences as part of String literals in the Java source code.

Table of Contents

 About This Book

 Character Sets and Encodings

 ASCII Character Set and Encoding

 GB2312 Character Set and Encoding

 GB18030 Character Set and Encoding

 JIS X0208 Character Set and Encodings

 Unicode Character Set

 UTF-8 (Unicode Transformation Format - 8-Bit)

 UTF-16, UTF-16BE and UTF-16LE Encodings

 UTF-32, UTF-32BE and UTF-32LE Encodings

 Python Language and Unicode Characters

Java Language and Unicode Characters

 Unicode Versions Supported in Java History

 'int' and 'String' - Basic Data Types for Unicode

 "Character" Class with Unicode Utility Methods

 Character.toChars() - "char" Sequence of Code Point

 Character.getNumericValue() - Numeric Value of Code Point

 "String" Class with Unicode Utility Methods

 String.length() Is Not Number of Characters

 String.toCharArray() Returns the UTF-16BE Sequence

String Literals and Source Code Encoding

 Character Encoding in Java

 Character Set Encoding Maps

 Encoding Conversion Programs for Encoded Text Files

 Using Notepad as a Unicode Text Editor

 Using Microsoft Word as a Unicode Text Editor

 Using Microsoft Excel as a Unicode Text Editor

 Unicode Fonts

 Archived Tutorials

 References

 Full Version in PDF/EPUB