PHP Tutorials - Herong's Tutorial Notes
Dr. Herong Yang, Version 2.21

Non ASCII Characters in HTML documents

Part:   1  2  3  4 

PHP Tutorials - Herong's Tutorial Notes © Dr. Herong Yang

Non ASCII Characters with MySQL

Inputting Non ASCII Characters

Controlling Response Header Lines

HTTP Request Variables

Sessions

Using Cookies

PHP SOAP Extension

PHP SOAP Extension - Server

Directories, Files and Images

Using MySQL with PHP

... Table of Contents

This chapter explains:

  • Basic Rules
  • French Characters in HTML Documents - UTF-8 Encoding
  • French Characters in HTML Documents - ISO-8859-1 Encoding
  • Chinese Characters in HTML Documents - UTF-8 Encoding
  • Chinese Characters in HTML Documents - GB22312 Encoding
  • Characters of Multiple Languages in HTML Documents

Basic Rules

As you see from the previous chapters, a Web based application always delivers information to the user interface as a HTML document. The application can either take a static HTML document from the file system, or generate a dynamic HTML document from a PHP script.

Let's concentrate on how to handle non ASCII characters in static HTML documents first. Here are the steps and technologies involved in entering a HTML document and delivering it to the uer interface:

H1. Key Sequences from keyboard
      |
      |- Text editor
      v
H2. HTML Document
      |
      |- Web server
      v
H3. HTTP Response
      |
      |- Internet TCP/IP Connection
      v
H4. HTTP Response
      |
      |- Web browser
      v
H5. Visiual characters on the Screen

Based on my experience, here are some basic rules related to those steps:

1. You must decide on the character encoding schema to be used in the HTML document first. For most of the languages, you have two options, a: use a encoding schema specific to that language; b: use a Unicode schema. For example, you can use either GB2312 (a simplified Chinese character schema) or UTF-8 (a Unicode character schema) for Chinese characters. My suggestion used to be "a". But today, I am suggesting "b", because Unicode schema can support all characters of all languages.

2. PHP seems to be a nice language. The data type of string is defined as a sequence of bytes, like C language. This is different than Java language, where string is defined as a sequence of Unicode characters. String literals in PHP can take any sequence of bytes. Therefore you can enter non ASCII characters as PHP string literals in any encoding schema.

3. From step "H1" to "H2", you need select good text editor that supports the encoding schema you have selected. The end goal of this step is simple - characters in the HTML documents must be stored in a file using the selected encoding schema. Don't under estimate the difficulty level of this step. It could be very frustrating, because most computer keyboards support alphabetic letters only. You may have to use some language specific input software to translate alphabetic letters into language specific characters. The editor sometimes may also store characters in memory in one encoding schema, and offer you different encoding schema when saving files to harddisk.

4. From step "H3" to "H4", it is the job between the Web server and Web browser. The HTTP response will be transmitted as is to the browser. The characters in the HTML document attached in the HTTP response will also be maintained as is.

5. From step "H4" to "H5", the browser opens the received HTML document and displays encoded characters into as written characters of the specific language. To do this, the browser needs your help. The first help is to specify the character encoding name, "charset", used in the HTML document as a <meta> tag. The second help is to make sure the browser can access the a character font file designed for the specified encoding schema.

If no character encoding name is specified in the <meta> tag, some browsers will try to detect the encoding schema based on the HTML document content. If not successful, browsers will use default encoding schemas. For example, Internet Explorer (IE) use "Western European" as the default encoding schema. "Western European" seems to be referring to "ISO-8859-1" standard.

(Continued on next part...)

Part:   1  2  3  4 

Dr. Herong Yang, updated in 2006
PHP Tutorials - Herong's Tutorial Notes - Non ASCII Characters in HTML documents