Basic Rules of Receiving Non-ASCII Characters from Input Forms

This section describes basic rules on how non-ASCII character strings should be managed at different steps to ensure localized text strings can be entered in HTML forms and received correctly by PHP scripts that process those forms.

As you see from the previous chapters, when PHP scripts are involved in a Web based application, they are always used behind a Web server. PHP scripts are expected to generate HTML documents and pass them back to the Web server. There are about four ways non ASCII characters can get into the HTML document through PHP scripts: a) Enter them as string literals; b) Receive from HTTP request; c) Retrieve them from files; d) Retrieve them from a database.

In this chapter, we will concentrate on how to handle non ASCII characters received in the HTTP request. Here are the steps involved in this scenario:

C1. Key sequences on keyboard
      |
      |- Language input tool (optional)
      v
C2. Byte sequences
      |
      |- Web browser
      v
C3: HTTP request
      |
      |- Internet TCP/IP Connection
      v
C4. HTTP request
      |
      |- Web server
      v
C5. CGI variables and input stream
      |
      |- PHP CGI interface
      v
C6. PHP built-in variable and input stream

Based on my experience, here are some basic rules related to those steps:

1. Page encoding - Input strings entered in a HTML page will be encoded immediately based on the page's "charset" setting. For example, if the page has "charset=iso-8859-1", double-byte Unicode characters will be encoded as HTML entities in the form of "&#nnnnn;", where "nnnnn" represents the decimal value of the Unicode character code. For example, "你" is Unicode character encoded as a HTML entity.

If the page has "charset=utf-8", double-byte Unicode characters will be encoded as UTF-8 byte sequences. For example, "\xE4\xBD\xA0" is a Unicode character encoded as a UTF-8 byte sequence.

2. URL encoding - Web browser will then apply "x-www-form-urlencoded" to all input strings when sending them to the server as part of the HTTP request. URL encoding converts all non ASCII bytes in the form of "%xx", "xx" is the HEX value of the byte. URL encoding also converts special characters in the form of "%xx", with one exception for the space character " ". It will be converted to "+".

For example, if the page "charset=iso-8859-1", a Unicode character is entered into the page. It will be encoded immediately as a HTML entity, like "". When sending it to the server, it will be encoded again as "%26%2320320%3B".

If the page has "charset=utf-8", the same Unicode character is entered into the page. It will be encoded immediately as a UTF-8 byte sequence, like '\xE4\xBD\xA0". When sending it to the server, it will be encoded again as "%E4%BD%A0".

3. From step "C3" to "C4", Internet will maintain the URL encoded input strings as is.

4. From step "C4" to "C5", Web server will maintain the URL encoded input strings as is.

5. From step "C5" to "C6", PHP CGI interface is doing something interesting for you:

6. What do you want to do with the characters in the input data is your decision. You could output them back to the HTML document, or store them in a file. Of course, you can apply any conversion you want to.

In the sections below, I will show you some sample PHP scripts to validate those rules.

Table of Contents

 About This Book

 Introduction and Installation of PHP

 PHP Script File Syntax

 PHP Data Types and Data Literals

 Variables, References, and Constants

 Expressions, Operations and Type Conversions

 Conditional Statements - "if" and "switch"

 Loop Statements - "while", "for", and "do ... while"

 Function Declaration, Arguments, and Return Values

 Arrays - Ordered Maps

 Interface with Operating System

 Introduction of Class and Object

 Integrating PHP with Apache Web Server

 Retrieving Information from HTTP Requests

 Creating and Managing Sessions in PHP Scripts

 Sending and Receiving Cookies in PHP Scripts

 Controlling HTTP Response Header Lines in PHP Scripts

 Managing File Upload

 Functions to Manage Directories, Files and Images

 Localization Overview of Web Applications

 Using Non-ASCII Characters in HTML Documents

 Using Non-ASCII Characters as PHP Script String Literals

Receiving Non-ASCII Characters from Input Forms

Basic Rules of Receiving Non-ASCII Characters from Input Forms

 Receiving Non-ASCII Characters with GET Method

 Receiving Non-ASCII Characters with POST Method

 Receiving Non ASCII Characters in UTF-8 Encoding

 Decoding HTML Entities

 "mbstring" Extension and Non-ASCII Encoding Management

 Managing Non-ASCII Character Strings with MySQL Servers

 Configuring and Sending Out Emails

 Managing PHP Engine and Modules on macOS

 Managing PHP Engine and Modules on CentOS

 Archived Tutorials

 References

 Full Version in PDF/EPUB