Basic Rules of Using Non-ASCII Characters in HTML Documents

PHP Tutorials - Herong's Tutorial Examples

∟Using Non-ASCII Characters in HTML Documents

∟Basic Rules of Using Non-ASCII Characters in HTML Documents

This section describes basic rules on how non-ASCII character strings should be managed at different steps to ensure localized text strings can be used in HTML documents and displayed correctly on the browser window.

As you can see from the previous chapters, a Web based application always delivers information to the user interface as a HTML document. The application can either take a static HTML document from the file system, or generate a dynamic HTML document from a PHP script.

First, let's concentrate on how to handle non ASCII characters in static HTML documents. Here are the steps and technologies involved in entering a HTML document and delivering it to the user interface:

H1. Key Sequences from keyboard
      |
      |- Text editor
      v
H2. HTML Document
      |
      |- Web server
      v
H3. HTTP Response
      |
      |- Internet TCP/IP Connection
      v
H4. HTTP Response
      |
      |- Web browser
      v
H5. Visiual characters on the Screen

Based on my experience, here are some basic rules related to those steps:

1. You must decide on the character encoding schema to be used in the HTML document first. For most of human written languages, you have two options, a) use a encoding schema specific to that language; b) use a Unicode schema. For example, you can use either GB2312 (a simplified Chinese character schema) or UTF-8 (a Unicode character schema) for Chinese characters. My suggestion used to be "a". But from now on, I am suggesting "b", because Unicode schema can support all characters of all languages.

2. PHP seems to be a nice language. The data type of string is defined as a sequence of bytes, like C language. This is different than Java language, where string is defined as a sequence of Unicode characters. String literals in PHP can take any sequence of bytes. Therefore you can enter non ASCII characters as PHP string literals in any encoding schema.

3. From step "H1" to "H2", you need select good text editor that supports the encoding schema you have selected. The end goal of this step is simple - characters in the HTML documents must be stored in a file using the selected encoding schema. Don't under estimate the difficulty level of this step. It could be very frustrating, because most computer keyboards support alphabetic letters only. You may have to use some language specific input software to translate alphabetic letters into language specific characters. The editor sometimes may also store characters in memory in one encoding schema, and offer you different encoding schema when saving files to hard disk.

4. From step "H3" to "H4", it is the job for the Internet to send data from the Web server to the Web browser. The HTTP response will be transmitted as is to the browser. The characters in the HTML document attached in the HTTP response will also be maintained as is.

5. From step "H4" to "H5", the browser opens the received HTML document and displays encoded characters as written characters of the specific language. To do this, the browser needs your help. The first help is to specify the character encoding name, "charset", used in the HTML document as a <meta> tag. The second help is to make sure the browser can access the a character font file designed for the specified encoding schema.

If no character encoding name is specified in the <meta> tag, some browsers will try to detect the encoding schema based on the HTML document content. If not successful, browsers will use default encoding schemas. For example, Internet Explorer (IE) use "Western European" as the default encoding schema. "Western European" seems to be referring to "ISO-8859-1" standard.