Unicode Tutorials - Herong's Tutorial Examples - v5.31, by Herong Yang
Unicode Versions Supported in Java-History
This section provides a quick summary of different Unicode versions supported in Java history. A major change was introduced in J2SE 5 to support Unicode 4.0 which contains supplementary characters in the range of U+10000 to U+10FFFF.
Unicode has been supported in the Java language since its first release in 1996. Both Unicode and Java have evolved with multiple versions and releases since then. Here is a summary of which version of Unicode is supported in which Java releases.
Java version Release date Unicode version ------------ ------------ --------------- Java 15 March 2020 Unicode 13.0 Java 13 March 2019 Unicode 12.1 Java 11 March 2018 Unicode 10.0 Java 10 September 2018 Unicode 8.0 Java 9 September 2017 Unicode 8.0 Java 8 March 2014 Unicode 6.2 Java SE 7 July 28, 2011 Unicode 6.0 Java SE 6 December 11, 2006 Unicode 4.0 J2SE 5.0 September 30, 2004 Unicode 4.0 J2SE 1.4 February 6, 2002 Unicode 3.0 J2SE 1.3 May 8, 2000 Unicode 2.1 J2SE 1.2 December 8, 1998 Unicode 2.1 JDK 1.1 February 19, 1997 Unicode 2.0 JDK 1.1.7 September 12, 1997 Unicode 2.1 JDK 1.1 February 19, 1997 Unicode 2.0 JDK 1.0 January 23, 1996 Unicode 1.1.5
From JDK 1.0 to JDK 1.4, Java can only support Unicode code points in the range of U+0000 to U+FFFF. Unicode characters in this range are called BMP (Basic Multilingual Plane) characters.
Designers of JDK 1.0 used the following primitive types and built-in class types to support Unicode BMP characters in very natural way:
char - A primitive type with 16-bit storage size, which is perfected designed to store the code point value of a single BMP character.
Character - A built-in class that wraps a value of the primitive type char in an object. The Character class is perfectly designed to represent a single BMP character as an object. The Character class also provides some useful static methods for managing individual BMP characters.
String - A built-in class that wraps an array of primitive type char in an object. The String class is perfectly designed to represent a sequence of BMP characters as an object. The String class also provides some useful static and instance methods for managing a string of BMP characters.
Starting in J2SE 5.0, support of Unicode code points in the range of U+10000 to U+10FFFF was introduced in Java. Unicode characters in this range are called supplementary characters, implemented in Unicode 3.1 in 2001.
Designers of J2SE 5.0 did not introduce any new primitive types and built-in class types to support Unicode supplementary characters. What they did was:
The surrogate char pair of a supplementary character is really the UTF-16BE encoded char sequence of the supplementary character. See other sections in this book for the UTF-16BE encoding algorithm.
Using UTF-16BE encoded char sequences to represent supplementary characters does give one main advantage to the existing class type String: A String object can still be viewed as a char sequence that represent a sequence of any Unicode characters: BMP characters and/or supplementary characters.
If you are interested in why J2SE 5.0 designers did not choose to add a new primitive type like char32 to help support supplementary characters, read the article "Supplementary Characters in the Java Platform" by Norbert Lindenberg and Masayoshi Okutsu at oracle.com/us/technologies/java/supplementary-142654.html.
Table of Contents
ASCII Character Set and Encoding
GB2312 Character Set and Encoding
GB18030 Character Set and Encoding
JIS X0208 Character Set and Encodings
UTF-8 (Unicode Transformation Format - 8-Bit)
UTF-16, UTF-16BE and UTF-16LE Encodings
UTF-32, UTF-32BE and UTF-32LE Encodings
►Java Language and Unicode Characters
►Unicode Versions Supported in Java-History
'int' and 'String' - Basic Data Types for Unicode
"Character" Class with Unicode Utility Methods
Character.toChars() - "char" Sequence of Code Point
Character.getNumericValue() - Numeric Value of Code Point
"String" Class with Unicode Utility Methods
String.length() Is Not Number of Characters
String.toCharArray() Returns the UTF-16BE Sequence
String Literals and Source Code Encoding
Encoding Conversion Programs for Encoded Text Files
Using Notepad as a Unicode Text Editor
Using Microsoft Word as a Unicode Text Editor
Using Microsoft Excel as a Unicode Text Editor
Unicode Code Point Blocks: 0000 - 0FFF
Unicode Code Point Blocks: 1000 - FFFF
Unicode Code Point Blocks: 10000 - 11FFF
Unicode Code Point Blocks: 12000 - 10FFFF