JSP and JSTL Tutorials - Herong's Tutorial Notes
Dr. Herong Yang, Version 3.09, 2006

Localization / Internationalization - Non ASCII Characters in JSP Pages

Part:   1  2  3  4  5 

JSP/JSTL Tutorials - Herong's Tutorial Notes © Dr. Herong Yang

Using Cookies

Using JavaBean Classes

HTTP Response Header Lines

Non ASCII Characters

JSTL and Expression Language

File Upload

Execution Context

JSP Elements

JSP Standard Tag Libraries (JSTL)

JSP Custom Tag

... Table of Contents

(Continued from previous part...)

Static HTML Text - JSP Page in XML Syntax

In the third test, the static text is inserted into a JSP page in XML syntax:

<?xml version="1.0" encoding="gb2312"?>
<jsp:root xmlns:jsp="http://java.sun.com/JSP/Page" 
   xmlns:c="http://java.sun.com/jstl/core" version="1.2"> 
<jsp:directive.page contentType="text/html; charset=gb2312"/>
<!-- StaticGB2312.jsp
     Copyright (c) 2002 by Dr. Herong Yang
-->
<html>
<body>
<p>
GB2312-binary: ˵=(0xCBB5C3F7)<br/>
GB2312-#xHEX: &#xCBB5;&#xC3F7;<br/>
GB2312-\uHEX: \uCBB5\uC3F7<br/>
Unicode-binary: ----=(0x8bf4660e)<br/>
Unicode-#xHEX: &#x8bf4;&#x660e;<br/>
Unicode-\uHEX: \u8bf4\u660e<br/>
Unicode-UTF8: 说明=(0xE8AFB4E6988E)<br/>
</p>
</body>
</html>
</jsp:root>

If you view this page with IE, you should will see that only Unicode-#xHEX line is displayed correctly. This is a big supprise to me:

  • The XML parser in Tomcat is not deconding my JSP page with gb2312.
  • My JSP page seems to be decoded with ISO-8859-1, Windows default encoding scheme.
  • The 0x0e code in Unicode-binary line is causing trouble to Tomcat server, so I have to remove those binary codes.
  • The Java class file is generated in UTF-8 encoding.
  • The "out" object and the Content-Type header are set correctly to GB2312.
  • The XML entity codes, #xHEX lines, are decoded into binary values. This is different than the standard syntax.

Here are the related lines of the generated Java class file:

   ...
   response.setContentType("text/html; charset=gb2312");
   ...
   out.write("<p>");
   out.write("\nGB2312-binary: ˵Ã÷=(0xCBB5C3F7)");
   out.write("<br/>");
   out.write("\nGB2312-#xHEX: ");
   out.write("쮵");
   out.write("쏷");
   out.write("<br/>");
   out.write("\nGB2312-\\uHEX: \\uCBB5\\uC3F7");
   out.write("<br/>");
   out.write("\nUnicode-binary: ----=(0x8bf4660e)");
   out.write("<br/>");
   out.write("\nUnicode-#xHEX: ");
   out.write("说");
   out.write("明");
   out.write("<br/>");
   out.write("\nUnicode-\\uHEX: \\u8bf4\\u660e");
   out.write("<br/>");
   out.write("\nUnicode-UTF8: 说明=(0xE8AFB4E6988E)");
   out.write("<br/>");
   out.write("</p>");
   ....

I have tried to change charset to UTF-8, but it did not work. JSP pages in XML syntax are always decoded as ISO-8859-1. May be there is a setting somewher to control this, but I don't know.

Supporting Characters from Multiple Languages

If you planning to write a page that has characters from multiple language encodings. you have to use Unicode codes and UTF-8 HTML document encoding. Here is an example with characters from two encodings: GB2312 and Big5.

<?xml version="1.0"?>
<jsp:root xmlns:jsp="http://java.sun.com/JSP/Page" 
   xmlns:c="http://java.sun.com/jstl/core" version="1.2"> 
<!-- HelpUnicodeUTF8.jsp
     Copyright (c) 2004 by Dr. Herong Yang
-->
<jsp:scriptlet><![CDATA[
   response.setContentType("text/html; charset=utf-8");
   out.println("<meta http-equiv=\"Content-Type\""
      + " content=\"text/html; charset=utf-8\"/>");
   out.println("<body>");
   out.println("<b>\u8bf4\u660e</b><br/>");
   out.println("<p>\u8fd9\u662f\u4e00\u4efd\u975e\u5e38\u95f4\u5355"
      + "\u7684\u8bf4\u660e\u4e66\u2026</p>");
   out.println("<b>\u8aaa\u660e</b><br/>");
   out.println("<p>\u9019\u662f\u4e00\u4efd\u975e\u5e38\u9593\u55ae"
      + "\u7684\u8aaa\u660e\u66f8\u2026</p>");
   out.println("</body>");
   out.println("</html>");
]]></jsp:scriptlet>
</jsp:root>

View this page with IE, you should see the same message appear twice, one as simplified Chinese, and the other as tranditional Chinese.

Conclusion

As you can see from my notes in the previous sections, localizing or internationalizing JSP pages is not an easy task. My recommendations are:
  • Avoid using static text. Put the entire page under a scriptlet, so all text messages are generated from Java statements.
  • Using Unicode codes in UTF-8 format or \uHEX format for string literals. It allows to support characters in all local languages in a single encoding.
  • Use UTF-8 as the HTML document encoding instead of encodings of a particular local language, like GB2312. This may cause problems for users on locale systems where Unicode fonts are not supported. But more and more locale systems are supporting Unicode and UTF-8 encoding.
  • I still don't know how to control the source code encoding of JSP pages in XML syntax.

Part:   1  2  3  4  5 

Dr. Herong Yang, updated in 2006
JSP and JSTL Tutorials - Herong's Tutorial Notes - Localization / Internationalization - Non ASCII Characters in JSP Pages