PHP Tutorials - Herong's Tutorial Examples - v5.13, by Dr. Herong Yang
Basic Rules of Receiving Non-ASCII Characters from Input Forms
This section describes basic rules on how non-ASCII character strings should be managed at different steps to ensure localized text strings can be entered in HTML forms and received correctly by PHP scripts that process those forms.
As you see from the previous chapters, when PHP scripts are involved in a Web based application, they are always used behind a Web server. PHP scripts are expected to generate HTML documents and pass them back to the Web server. There are about four ways non ASCII characters can get into the HTML document through PHP scripts: a) Enter them as string literals; b) Receive from HTTP request; c) Retrieve them from files; d) Retrieve them from a database.
In this chapter, we will concentrate on how to handle non ASCII characters received in the HTTP request. Here are the steps involved in this scenario:
C1. Key sequences on keyboard | |- Language input tool (optional) v C2. Byte sequences | |- Web browser v C3: HTTP request | |- Internet TCP/IP Connection v C4. HTTP request | |- Web server v C5. CGI variables and input stream | |- PHP CGI interface v C6. PHP built-in variable and input stream
Based on my experience, here are some basic rules related to those steps:
1. Page encoding - Input strings entered in a HTML page will be encoded immediately based on the page's "charset" setting. For example, if the page has "charset=iso-8859-1", double-byte Unicode characters will be encoded as HTML entities in the form of "&#nnnnn;", where "nnnnn" represents the decimal value of the Unicode character code. For example, "你" is Unicode character encoded as a HTML entity.
If the page has "charset=utf-8", double-byte Unicode characters will be encoded as UTF-8 byte sequences. For example, "\xE4\xBD\xA0" is a Unicode character encoded as a UTF-8 byte sequence.
2. URL encoding - Web browser will then apply "x-www-form-urlencoded" to all input strings when sending them to the server as part of the HTTP request. URL encoding converts all non ASCII bytes in the form of "%xx", "xx" is the HEX value of the byte. URL encoding also converts special characters in the form of "%xx", with one exception for the space character " ". It will be converted to "+".
For example, if the page "charset=iso-8859-1", a Unicode character is entered into the page. It will be encoded immediately as a HTML entity, like "你". When sending it to the server, it will be encoded again as "%26%2320320%3B".
If the page has "charset=utf-8", the same Unicode character is entered into the page. It will be encoded immediately as a UTF-8 byte sequence, like '\xE4\xBD\xA0". When sending it to the server, it will be encoded again as "%E4%BD%A0".
3. From step "C3" to "C4", Internet will maintain the URL encoded input strings as is.
4. From step "C4" to "C5", Web server will maintain the URL encoded input strings as is.
5. From step "C5" to "C6", PHP CGI interface is doing something interesting for you:
6. What do you want to do with the characters in the input data is your decision. You could output them back to the HTML document, or store them in a file. Of course, you can apply any conversion you want to.
In the sections below, I will show you some sample PHP scripts to validate those rules.
Table of Contents