|
Receiving Non ASCII Characters from Input Forms
Part:
1
2
3
4
5
6
7
This chapter explains:
- Basic Rules
- Receiving Non ASCII Characters with GET Method
- Receiving Non ASCII Characters with POST Method
- Receiving Non ASCII Characters in UTF-8
- Decoding HTML Entities
Basic Rules
As you see from the previous chapters, when PHP scripts are involved in a Web based application,
they are always used behind a Web server. PHP scripts are expected to generate HTML documents and
pass them back to the Web server. There are about four ways non ASCII characters can get into the HTML document
through PHP scripts: a) Enter them as string literals; b) Receive from HTTP request; c) Retrieve them from files;
d) Retrieve them from a database.
In this chapter, we will concentrate on how to handle non ASCII characters received in the HTTP request.
Here are the steps involved in this scenario:
C1. Key sequences on keyboard
|
|- Language input tool (optional)
v
C2. Byte sequences
|
|- Web browser
v
C3: HTTP request
|
|- Internet TCP/IP Connection
v
C4. HTTP request
|
|- Web server
v
C5. CGI variables and input stream
|
|- PHP CGI interface
v
C6. PHP built-in variable and input stream
Based on my experience, here are some basic rules related to those steps:
1. Page encoding - Input strings entered in a HTML page will be encoded immediately based on the page's
"charset" setting. For example, if the page has "charset=iso-8859-1", double-byte Unicode characters
will be encoded as HTML entities in the form of "&#nnnnn;", where "nnnnn" represents the
decimal value of the Unicode character code. For example, "你" is Unicode character encoded
as a HTML entity.
If the page has "charset=utf-8", double-byte Unicode characters will be encoded as UTF-8
byte sequences. For example, "\xE4\xBD\xA0" is a Unicode character encoded as a UTF-8 byte sequence.
2. URL encoding - Web browser will then apply "x-www-form-urlencoded" to all input strings when sending
them to the server as part of the HTTP request. URL encoding
converts all non ASCII bytes in the form of "%xx", "xx" is the HEX value of the byte. URL encoding
also converts special characters in the form of "%xx", with one exception for the space character
" ". It will be converted to "+".
For example, if the page "charset=iso-8859-1", a Unicode character is entered into the page.
It will be encoded immediately as a HTML entity, like "你". When sending it to the server,
it will be encoded again as "%26%2320320%3B".
If the page has "charset=utf-8", the same Unicode character is entered into the page.
It will be encoded immediately as a UTF-8 byte sequence, like '\xE4\xBD\xA0". When sending it to the server,
it will be encoded again as "%E4%BD%A0".
(Continued on next part...)
Part:
1
2
3
4
5
6
7
|