PHP Tutorials - Herong's Tutorial Notes
Dr. Herong Yang, Version 2.21

Receiving Non ASCII Characters from Input Forms

Part:   1  2  3  4  5  6  7 

PHP Tutorials - Herong's Tutorial Notes © Dr. Herong Yang

Non ASCII Characters with MySQL

Inputting Non ASCII Characters

Controlling Response Header Lines

HTTP Request Variables

Sessions

Using Cookies

PHP SOAP Extension

PHP SOAP Extension - Server

Directories, Files and Images

Using MySQL with PHP

... Table of Contents

This chapter explains:

  • Basic Rules
  • Receiving Non ASCII Characters with GET Method
  • Receiving Non ASCII Characters with POST Method
  • Receiving Non ASCII Characters in UTF-8
  • Decoding HTML Entities

Basic Rules

As you see from the previous chapters, when PHP scripts are involved in a Web based application, they are always used behind a Web server. PHP scripts are expected to generate HTML documents and pass them back to the Web server. There are about four ways non ASCII characters can get into the HTML document through PHP scripts: a) Enter them as string literals; b) Receive from HTTP request; c) Retrieve them from files; d) Retrieve them from a database.

In this chapter, we will concentrate on how to handle non ASCII characters received in the HTTP request. Here are the steps involved in this scenario:

C1. Key sequences on keyboard
      |
      |- Language input tool (optional)
      v
C2. Byte sequences
      |
      |- Web browser
      v
C3: HTTP request
      |
      |- Internet TCP/IP Connection
      v
C4. HTTP request
      |
      |- Web server
      v
C5. CGI variables and input stream 
      |
      |- PHP CGI interface
      v
C6. PHP built-in variable and input stream

Based on my experience, here are some basic rules related to those steps:

1. Page encoding - Input strings entered in a HTML page will be encoded immediately based on the page's "charset" setting. For example, if the page has "charset=iso-8859-1", double-byte Unicode characters will be encoded as HTML entities in the form of "&#nnnnn;", where "nnnnn" represents the decimal value of the Unicode character code. For example, "你" is Unicode character encoded as a HTML entity.

If the page has "charset=utf-8", double-byte Unicode characters will be encoded as UTF-8 byte sequences. For example, "\xE4\xBD\xA0" is a Unicode character encoded as a UTF-8 byte sequence.

2. URL encoding - Web browser will then apply "x-www-form-urlencoded" to all input strings when sending them to the server as part of the HTTP request. URL encoding converts all non ASCII bytes in the form of "%xx", "xx" is the HEX value of the byte. URL encoding also converts special characters in the form of "%xx", with one exception for the space character " ". It will be converted to "+".

For example, if the page "charset=iso-8859-1", a Unicode character is entered into the page. It will be encoded immediately as a HTML entity, like "你". When sending it to the server, it will be encoded again as "%26%2320320%3B".

If the page has "charset=utf-8", the same Unicode character is entered into the page. It will be encoded immediately as a UTF-8 byte sequence, like '\xE4\xBD\xA0". When sending it to the server, it will be encoded again as "%E4%BD%A0".

(Continued on next part...)

Part:   1  2  3  4  5  6  7 

Dr. Herong Yang, updated in 2006
PHP Tutorials - Herong's Tutorial Notes - Receiving Non ASCII Characters from Input Forms