|
Receiving Non ASCII Characters from Input Forms
Part:
1
2
3
4
5
6
7
(Continued from previous part...)
Receiving Non ASCII Characters in UTF-8
In the previous scripts, "charset=iso-8859-1" is used for the input page. Now let's play with
"charset=utf-8". Here is my sample script:
<?php # InputUtf8Get.php
# Copyright (c) 2005 by Dr. Herong Yang, http://www.herongyang.com/
#
#- Promoting CGI values to local variables
global $r_English, $r_Spanish, $r_Korean, $r_ChineseUtf8;
global $r_ChineseGb2312;
import_request_variables("GPC","r_");
#- Generating HTML document
print("<html>");
print('<meta http-equiv="Content-Type"'
.' content="text/html; charset=utf-8"/>');
print("<body>\n");
print("<form action=InputUtf8Get.php method=get>");
print("English ASCII: <input name=English"
." value='$r_English' size=16><br>\n");
print("Spanish UTF-8: <input name=Spanish"
." value='$r_Spanish' size=16><br>\n");
print("Korean UTF-8: <input name=Korean"
." value='$r_Korean' size=16><br>\n");
print("Chinese UTF-8: <input name=ChineseUtf8"
." value='$r_ChineseUtf8' size=16><br>\n");
print("Chinese GB2312: <input name=ChineseGb2312"
." value='$r_ChineseGb2312' size=16><br>\n");
print("<input type=submit name=submit value=Submit>\n");
print("</form>\n");
#- Outputing input strings back to HTML document
print("<hr>");
print("<pre>");
foreach ($_GET as $k => $v) {
print "$k = ($v)\n";
}
print("</pre>");
print("</body>");
print("</html>");
#- Dumping input strings to a file
$file = fopen("\\temp\\InputUtf8Get.txt", 'ab');
$str = "------\n";
fwrite($file, $str, strlen($str));
if (array_key_exists('QUERY_STRING',$_SERVER)) {
$str = $_SERVER['QUERY_STRING'];
} else {
$str = NULL;
}
fwrite($file, $str, strlen($str));
$str = "------\n";
fwrite($file, $str, strlen($str));
foreach ($_REQUEST as $k => $v) {
$str = "$k = ($v)\n";
fwrite($file, $str, strlen($str));
}
fclose($file);
?>
If you enter the same input strings as in the previous tests:
English ASCII: Hello world!
Spanish UTF-8: ola mundo!
Korean UTF-8: ???? ?? !
Chinese UTF-8: ????!
Chinese GB2312: 世界你好!
The page returned with the input strings displayed below the form. They look correct to me.
If you open the dump file, \temp\InputUtf8Get.txt, you will see how input strings are URL encoded in
query string, and decoded in $_REQUEST.
------
English=Hello+world%21&
Spanish=%C2%A1Hola+mundo%21&
Korean=%EC%97%AC%EB%B3%B4%EC%84%B8%EC%9A%94+%EC%84%B8%EA%B3%84+%21&
ChineseUtf8=%E4%BD%A0%E5%A5%BD%E4%B8%96%E7%95%8C%21&
ChineseGb2312=%C3%8A%C3%80%C2%BD%C3%A7%C3%84%C3%A3%C2%BA%C3%83%C2%A3
%C2%A1
&submit=Submit------
------
English = (Hello world!)
Spanish = (ola mundo!)
Korean = (???? ?? !)
ChineseUtf8 = (????!)
ChineseGb2312 = (世界你好!)
submit = (Submit)
Again, the result matches the rules listed earlier in this chapter. Input strings are recorded as
UTF-8 byte sequences when entered on the page. Then each byte is URL encoded as %xx when sending
to the server. When input strings are parsed into $_REQUEST, they are decoded back to UTF-8 byte
sequences.
One surprise to me is that the GB2312 characters are also recorded as UTF-8 byte sequences.
(Continued on next part...)
Part:
1
2
3
4
5
6
7
|