JSP and JSTL Tutorials - Herong's Tutorial Notes
Dr. Herong Yang, Version 3.09, 2006

Localization / Internationalization - Non ASCII Characters in JSP Pages

Part:   1  2  3  4  5 

JSP/JSTL Tutorials - Herong's Tutorial Notes © Dr. Herong Yang

Using Cookies

Using JavaBean Classes

HTTP Response Header Lines

Non ASCII Characters

JSTL and Expression Language

File Upload

Execution Context

JSP Elements

JSP Standard Tag Libraries (JSTL)

JSP Custom Tag

... Table of Contents

(Continued from previous part...)

In order to test out how to control those factors, I picked two simplified Chinese characters, and entered them in 7 different formats as a simple HTML paragraph:

<p>
GB2312-binary: ˵Ã÷=(0xCBB5C3F7)<br/>      
GB2312-#xHEX: &#xCBB5;&#xC3F7;<br/>
GB2312-\uHEX: \uCBB5\uC3F7<br/>
Unicode-binary: ‹ô明=(0x8bf4660e)<br/>
Unicode-#xHEX: &#x8bf4;&#x660e;<br/>
Unicode-\uHEX: \u8bf4\u660e<br/>
Unicode-UTF8: 说明=(0xE8AFB4E6988E)<br/>
</p>

Hex numbers are provided next to the binary codes, just in case if you have trouble to copy this file to your local system.

In the next 3 sections, I will put this paragraph into a regular HTML file, a JSP page with standard syntax, and a JSP page with XML syntax to see how Tomcat server will convert them into Java class files and in what incodings.

Static HTML Text - HTML Page

In the first test, the static text is inserted into a regular HTML file:

<html>
<!-- StaticGB2312.html
     Copyright (c) 2002 by Dr. Herong Yang
-->
<body>
<p>
GB2312-binary: ˵Ã÷=(0xCBB5C3F7)<br/>      
GB2312-#xHEX: &#xCBB5;&#xC3F7;<br/>
GB2312-\uHEX: \uCBB5\uC3F7<br/>
Unicode-binary: ‹ô明=(0x8bf4660e)<br/>
Unicode-#xHEX: &#x8bf4;&#x660e;<br/>
Unicode-\uHEX: \u8bf4\u660e<br/>
Unicode-UTF8: 说明=(0xE8AFB4E6988E)<br/>
</p>
</body>
</html>

Now view StaticGB2312.html with IE, and try to change the encoding schema in the View menu. Results match my expectations except one area:

  • Westen European (Windows): Unicode-#xHEX line shows up correctly. I wasn't expecting this, and had no idea why.
  • Chinese Simplified (GB2312): GB2312-binary line shows up correctly.
  • Unicode (UTF-8): Unicode-UTF8 line shows up correctly.

Since this is not a JSP, Tomcat will not convert it into a Java class file. I am using this test to validate that the codes are entered correctly.

Static HTML Text - JSP Page in Standard Syntax

In the second test, the static text is inserted into a JSP page in standard syntax:

<%@ page contentType="text/html; charset=gb2312" %>
<!-- StaticGB2312.jsp
     Copyright (c) 2002 by Dr. Herong Yang
-->
</html>
</body>
<p>
GB2312-binary: ˵Ã÷=(0xCBB5C3F7)<br/>      
GB2312-#xHEX: &#xCBB5;&#xC3F7;<br/>
GB2312-\uHEX: \uCBB5\uC3F7<br/>
Unicode-binary: ‹ô明=(0x8bf4660e)<br/>
Unicode-#xHEX: &#x8bf4;&#x660e;<br/>
Unicode-\uHEX: \u8bf4\u660e<br/>
Unicode-UTF8: 说明=(0xE8AFB4E6988E)<br/>
</p>
</body>
</html>

If you view this page in IE, you will that see both GB2312-binary line and Unicode-#xHEX line are displayed correctly. Here is the explanation:

  • The "charset" value gb2312 in the page directive statement tells Tomcat to read this JSP files as GB2312 encoding. So GB2312-binary line is decoded correctly into Unicode codes.
  • All other binary lines are decoded incorrectly, because they are not GB2312 codes.
  • Uicode-#xHEX line is not decoded, because they are normal ASCII characters.
  • When generating the Java class file, all strings are encoded as UTF-8. This is the default setting of Tomcat. You can change this in the conf/web.xml file.
  • The "charset" value gb2312 also tells Tomcat to change the encoding to GB2312 on the "out" object, and the Conten-Type HTTP header, so the generated HTML document will in GB2312 encoding.

To appove the above explanation, here is the related lines of the generated Java class file:

   ...      
   response.setContentType("text/html; charset=gb2312");
   ...
   out.write("<p>\r\nGB2312-binary: 说明=(0xCBB5C3F7)");
   out.write("<br/>\r\nGB2312-#xHEX: &#xCBB5;&#xC3F7;");
   out.write("<br/>\r\nGB2312-\\uHEX: \\uCBB5\\uC3F7");
   out.write("<br/>\r\nUnicode-binary: ��明=(0x8bf4660e)");
   out.write("<br/>\r\nUnicode-binary: ----=(0x8bf4660e)");
   out.write("<br/>\r\nUnicode-#xHEX: &#x8bf4;&#x660e;");
   out.write("<br/>\r\nUnicode-\\uHEX: \\u8bf4\\u660e");
   out.write("<br/>\r\nUnicode-UTF8: 璇存��=(0xE8AFB4E6988E)");
   ...

If you change the "charset" to utf-8, I am sure Unicode-UTF8 line will be displayed correctly. You know why.

By the way, "charset" can also be specified as "pageEncoding" in the "page" directive statement.

(Continued on next part...)

Part:   1  2  3  4  5 

Dr. Herong Yang, updated in 2006
JSP and JSTL Tutorials - Herong's Tutorial Notes - Localization / Internationalization - Non ASCII Characters in JSP Pages