What Is Character Encoding
This section provides a quick introduction of Unicode character encodings and other local language encodings that are supported by Java.
Character Encoding: A map scheme between code points of a coded character
set and sequences of bytes.
Coded Character Set: A character set in which each character has an
assigned integral number.
Code Point: An integral number assigned to a character in a coded character set.
As of Unicode 6.1, introduced in January, 2012, Unicode code point values have a range from 0x0000
Unicode: A coded character set that contains all characters used
in the written languages of the world and special symbols.
As as Unicode 6.1, introduced in January, 2012, Unicode character set contains
The standard Unicode encoding is called UTF-32BE (Unicode Transformation Format - 32-bit Big Endian),
which maps every Unicode character to a sequence of 4 bytes. For any given Unicode
character, the UTF-32BE encoded byte sequence can be obtained by putting the character's code point
integer number in the 4-byte binary format with the most significant byte listed first.
There are also other character encodings used on the Unicode character set, as described in previous
- UTF-32BE - The standard Unicode character encoding as mentioned above.
- UTF-32LE - Same as UTF-32BE, except that the least significant byte is listed first.
- UTF-16BE - Every Unicode character is mapped to a sequence of 2 or 4 bytes with the most significant byte listed first.
- UTF-16LE - Same as UTF-16BE, except that the least significant byte is listed first.
- UTF-8 - Every Unicode character is mapped to a sequence of 1, 2, 3 or 4 byte.
Since Unicode character set is a super set of many local language character sets, many other
character encodings can also be applied to different subsets of the Unicode character set.
Here are some examples of local language character encodings:
- ASCII - The standard encoding for the ASCII character set.
- ISO-8859-1 - The ISO standard encoding for Latin character set.
- GBK - A standard encoding for simplified Chinese character set.
- Big5 - A standard encoding for traditional Chinese character set.
As of Java 11, released in July 2011, Java language can support the Unicode character set
defined in Unicode 10.0. UTF-32, UTF-16, and UTF-8 encodings are fully supported in Java.
Java can also help to you to perform local language character encodings too.
See the next tutorial for full list of encodings supported in Java 11.
Java offers the following built-in classes to support Unicode character set, local language character subsets,
and their encodings:
- java.nio.charset.Charset - Defined in the JDK document as "A named mapping between sequences of sixteen-bit
Unicode code units and sequences of bytes. This class defines methods for creating decoders and encoders and for retrieving
the various names associated with a charset. Instances of this class are immutable."
The "Charset" class represents a particular character encoding defined a particular character set.
- java.nio.charset.CharsetEncoder - Defined in the JDK document as "An engine that can transform a sequence
of sixteen-bit Unicode characters into a sequence of bytes in a specific charset."
- java.nio.charset.CharsetDecoder - Defined in the JDK document as "An engine that can transform a sequence
of bytes in a specific charset into a sequence of sixteen-bit Unicode characters."
Table of Contents
About This Book
Character Sets and Encodings
ASCII Character Set and Encoding
GB2312 Character Set and Encoding
GB18030 Character Set and Encoding
JIS X0208 Character Set and Encodings
Unicode Character Set
UTF-8 (Unicode Transformation Format - 8-Bit)
UTF-16, UTF-16BE and UTF-16LE Encodings
UTF-32, UTF-32BE and UTF-32LE Encodings
Java Language and Unicode Characters
►Character Encoding in Java
►What Is Character Encoding
List of Supported Character Encodings in Java
EncodingSampler.java - Testing encode() Methods
Examples of CP1252 and ISO-8859-1 Encodings
Examples of US-ASCII, UTF-8, UTF-16 and UTF-32 Encodings
Examples of GB18030 Encoding
Testing decode() Methods
Character Set Encoding Maps
Encoding Conversion Programs for Encoded Text Files
Using Notepad as a Unicode Text Editor
Using Microsoft Word as a Unicode Text Editor
Using Microsoft Excel as a Unicode Text Editor
Unicode Code Point Blocks: 0000 - 0FFF
Unicode Code Point Blocks: 1000 - FFFF
Unicode Code Point Blocks: 10000 - 11FFF
Unicode Code Point Blocks: 12000 - 10FFFF
Full Version in PDF/EPUB