|
Localization / Internationalization - Non ASCII Characters in JSP Pages
Part:
1
2
3
4
5
(Continued from previous part...)
In order to test out how to control those factors, I picked two simplified
Chinese characters, and entered them in 7 different formats as a simple HTML paragraph:
<p>
GB2312-binary: ˵Ã÷=(0xCBB5C3F7)<br/>
GB2312-#xHEX: 쮵쏷<br/>
GB2312-\uHEX: \uCBB5\uC3F7<br/>
Unicode-binary: ‹ô明=(0x8bf4660e)<br/>
Unicode-#xHEX: 说明<br/>
Unicode-\uHEX: \u8bf4\u660e<br/>
Unicode-UTF8: 说明=(0xE8AFB4E6988E)<br/>
</p>
Hex numbers are provided next to the binary codes, just in case if you have
trouble to copy this file to your local system.
In the next 3 sections, I will put this paragraph into a regular HTML file, a JSP page with
standard syntax, and a JSP page with XML syntax to see how Tomcat server will convert them
into Java class files and in what incodings.
Static HTML Text - HTML Page
In the first test, the static text is inserted into a regular HTML file:
<html>
<!-- StaticGB2312.html
Copyright (c) 2002 by Dr. Herong Yang
-->
<body>
<p>
GB2312-binary: ˵Ã÷=(0xCBB5C3F7)<br/>
GB2312-#xHEX: 쮵쏷<br/>
GB2312-\uHEX: \uCBB5\uC3F7<br/>
Unicode-binary: ‹ô明=(0x8bf4660e)<br/>
Unicode-#xHEX: 说明<br/>
Unicode-\uHEX: \u8bf4\u660e<br/>
Unicode-UTF8: 说明=(0xE8AFB4E6988E)<br/>
</p>
</body>
</html>
Now view StaticGB2312.html with IE, and try to change the encoding schema
in the View menu. Results match my expectations except one area:
- Westen European (Windows): Unicode-#xHEX line shows up correctly. I wasn't
expecting this, and had no idea why.
- Chinese Simplified (GB2312): GB2312-binary line shows up correctly.
- Unicode (UTF-8): Unicode-UTF8 line shows up correctly.
Since this is not a JSP, Tomcat will not convert it into a Java class file.
I am using this test to validate that the codes are entered correctly.
Static HTML Text - JSP Page in Standard Syntax
In the second test, the static text is inserted into a JSP page in standard syntax:
<%@ page contentType="text/html; charset=gb2312" %>
<!-- StaticGB2312.jsp
Copyright (c) 2002 by Dr. Herong Yang
-->
</html>
</body>
<p>
GB2312-binary: ˵Ã÷=(0xCBB5C3F7)<br/>
GB2312-#xHEX: 쮵쏷<br/>
GB2312-\uHEX: \uCBB5\uC3F7<br/>
Unicode-binary: ‹ô明=(0x8bf4660e)<br/>
Unicode-#xHEX: 说明<br/>
Unicode-\uHEX: \u8bf4\u660e<br/>
Unicode-UTF8: 说明=(0xE8AFB4E6988E)<br/>
</p>
</body>
</html>
If you view this page in IE, you will that see both GB2312-binary line and Unicode-#xHEX
line are displayed correctly. Here is the explanation:
- The "charset" value gb2312 in the page directive statement tells Tomcat to read
this JSP files as GB2312
encoding. So GB2312-binary line is decoded correctly into Unicode codes.
- All other binary lines are decoded incorrectly, because they are not GB2312 codes.
- Uicode-#xHEX line is not decoded, because they are normal ASCII characters.
- When generating the Java class file, all strings are encoded as UTF-8. This is the
default setting of Tomcat. You can change this in the conf/web.xml file.
- The "charset" value gb2312 also tells Tomcat to change the encoding to GB2312 on
the "out" object, and the Conten-Type HTTP header, so the generated HTML document will
in GB2312 encoding.
To appove the above explanation, here is the related lines of the generated Java class file:
...
response.setContentType("text/html; charset=gb2312");
...
out.write("<p>\r\nGB2312-binary: 说明=(0xCBB5C3F7)");
out.write("<br/>\r\nGB2312-#xHEX: 쮵쏷");
out.write("<br/>\r\nGB2312-\\uHEX: \\uCBB5\\uC3F7");
out.write("<br/>\r\nUnicode-binary: ��明=(0x8bf4660e)");
out.write("<br/>\r\nUnicode-binary: ----=(0x8bf4660e)");
out.write("<br/>\r\nUnicode-#xHEX: 说明");
out.write("<br/>\r\nUnicode-\\uHEX: \\u8bf4\\u660e");
out.write("<br/>\r\nUnicode-UTF8: ç’‡å˜ï¿½ï¿½=(0xE8AFB4E6988E)");
...
If you change the "charset" to utf-8, I am sure Unicode-UTF8 line will be displayed
correctly. You know why.
By the way, "charset" can also be specified as "pageEncoding"
in the "page" directive statement.
(Continued on next part...)
Part:
1
2
3
4
5
|