PHP Tutorials - Herong's Tutorial Notes
Dr. Herong Yang, Version 2.21

Non ASCII Characters as String Literals

Part:   1  2  3  4  5 

PHP Tutorials - Herong's Tutorial Notes © Dr. Herong Yang

Non ASCII Characters with MySQL

Inputting Non ASCII Characters

Controlling Response Header Lines

HTTP Request Variables

Sessions

Using Cookies

PHP SOAP Extension

PHP SOAP Extension - Server

Directories, Files and Images

Using MySQL with PHP

... Table of Contents

This chapter explains:

  • Basic Rules
  • French Characters in String Literals - UTF-8 Encoding
  • French Characters in String Literals - ISO-8859-1 Encoding
  • Chinese Characters in String Literals - UTF-8 Encoding
  • Chinese Characters in String Literals - GB2312 Encoding
  • Characters of Multiple Languages in String Literals

Basic Rules

As you see from the previous chapters, when PHP scripts are involved in a Web based application, they are always used behind a Web server. PHP scripts are expected to generate HTML documents and pass them back to the Web server. There are about four ways non ASCII characters can get into the HTML document through PHP scripts: a) Enter them as string literals; b) Receive from HTTP request; c) Retrieve them from files; d) Retrieve them from a database.

In this chapter, we will concentrate on how to include non ASCII characters in PHP scripts as string literals. Here are the steps involved in this scenario:

A1. Key Sequences from keyboard
      |
      |- Text editor
      v
A2. PHP File
      |
      |- PHP CGI engine
      v
A3. HTML Document

Based on my experience, here are some basic rules related to those steps:

1. You must decide on the character encoding schema to be used in your PHP script file. For most of the languages, you have two options, a: use a encoding schema specific to that language; b: use a Unicode schema. For example, you can use either GB2312 (a simplified Chinese character schema) or UTF-8 (a Unicode character schema) for Chinese characters. My suggestion used to be "a". But today, I am suggesting "b", because Unicode schema can support all characters of all languages.

2. From step "A1" to "A2", you need select good text editor that supports the encoding schema you have decided. The end goal of this step is simple - characters in string literals must be stored in the PHP file using the decided encoding schema. Don't under estimate the difficulty level of this step. It could be very frustrating, because most computer keyboards support alphabetic letters only. You may have to use some language specific input software to translate alphabetic letters into language specific characters. The editor sometimes may also store characters in memory in one encoding schema, and offer you different encoding schema when saving files to harddisk.

3. String data type is defined as a sequence of bytes in PHP, like C language. This is different than Java language, where string data type is defined as a sequence of Unicode characters. String literals in PHP are also taken as sequences of bytes. This is a nice feature. It allows us to enter non ASCII characters in almost any encoding schema.

4. All PHP built-in string functions assume that strings are sequences of bytes. For example, strlen() returns the number of bytes of the given string, not the number of characters of a specific language. To manage strings as sequences of characters, we need to use Multibyte String functions, mb_*().

5. From step "A2" to "A3", HTML documents are generated from PHP script mainly through the print() function. The print() function will nicely copy every bytes from the specified string to HTML documents. This guarantees that any non ASCII characters encoded in any encoding schema will be copied correctly to the HTML document. Again, this is different than JSP pages, where strings will be converted into bytes stream based a specified encoding schema, if you are using character based output stream functions.

6. If you do want to convert from one encoding schema to another encoding schema during the print() function call, you can use mb_output_handler as the call back function on the output buffer: ob_start("mb_output_handler").

(Continued on next part...)

Part:   1  2  3  4  5 

Dr. Herong Yang, updated in 2006
PHP Tutorials - Herong's Tutorial Notes - Non ASCII Characters as String Literals