|
Managing Non ASCII Character Strings
Part:
1
2
3
4
5
(Continued from previous part...)
I tested this script with IE, and entered the following strings:
English ASCII: Hello world!
Spanish UTF-8: ola mundo!
Korean UTF-8: ???? ?? !
Chinese UTF-8: ????!
Chinese GB2312: 世界你好!
The returning page showed input strings correctly. But the source code was very interesting:
<html><meta http-equiv="Content-Type" content="text/html;
charset=gb2312"/><body>
<form action=MbStringHttp.php method=get>
English ASCII: <input name=English value='Hello world!' size=16><br>
Spanish UTF-8: <input name=Spanish value='¡Hola mundo!'
size=16><br>
Korean UTF-8: <input name=Korean value='여보세
요 세계 !' size=16><br>
Chinese UTF-8: <input name=ChineseUtf8 value='你好世界!' size=16><br>
Chinese GB2312: <input name=ChineseGb2312 value='ÊÀ
½çÄãºÃ£¡'
size=16><br>
<input type=submit name=submit value=Submit>
</form>
<hr><pre>Hello world!
¡Hola mundo!
여보세요 세계 !
你好世界!
ÊÀ½çÄãºÃ
£¡
</pre></body></html>
- When the Web page has "charset=gb2312", some Unicode characters are recorded as HTML named entities,
like "¡", and "Ê". Some Unicode characters are recorded as HTML numeric entities,
like "여" and "보".
- When Chinese characters in UTF-8 encoding are copied into the form, they are recorded as GB2312 encoding.
- When Chinese characters in GB2312 encoding are copied into the form, they are recorded as HTML named
entities. I don't know why.
I looked at the dump file, \temp\MbStringHttp.txt:
--- Query String ---
English=Hello+world%21&
Spanish=%26iexcl%3BHola+mundo%21&
Korean=%26%2350668%3B%26%2348372%3B%26%2349464%3B%26%2350836%3B+
%26%2349464%3B%26%2344228%3B+%21&
ChineseUtf8=%C4%E3%BA%C3%CA%C0%BD%E7%21&
ChineseGb2312=%26Ecirc%3B%26Agrave%3B%26frac12%3B%26ccedil%3B
%26Auml%3B%26atilde%3B%26ordm%3B%26Atilde%3B%26pound%3B%26iexcl%3B&
submit=Submit
--- Raw reqeust input ---
English = (Hello world!)
Spanish = (¡Hola mundo!)
Korean = (여보세요 세계 !)
ChineseUtf8 = (你好世界!)
ChineseGb2312 = (ÊÀ½çÄãº
ã¡)
submit = (Submit)
--- Converted reqeust input ---
English = (Hello world!)
Spanish = (¡Hola mundo!)
Korean = (여보세요 세계 !)
ChineseUtf8 = (浣犲ソ涓栫晫!)
ChineseGb2312 = (ÊÀ½çÄãº
ã¡)
submit = (Submit)
My script handled HTTP input and output encoding correctly, if the input strings are recorded in GB2312
by the Web browser. For other characters recorded as HTML entities, you need to avoid them by telling your users
to enter data correctly.
Conclusion
- "mbstring" is a very useful extension for to handle multi-byte character strings.
- "mbstring" is also useful to convert strings to different encodings.
- The best approach is to use UTF-8 for your Web pages and the internal encoding for your script.
Part:
1
2
3
4
5
|