|
Non ASCII Characters in HTML documents
Part:
1
2
3
4
This chapter explains:
- Basic Rules
- French Characters in HTML Documents - UTF-8 Encoding
- French Characters in HTML Documents - ISO-8859-1 Encoding
- Chinese Characters in HTML Documents - UTF-8 Encoding
- Chinese Characters in HTML Documents - GB22312 Encoding
- Characters of Multiple Languages in HTML Documents
Basic Rules
As you see from the previous chapters, a Web based application always
delivers information to the user interface as a HTML document. The application
can either take a static HTML document from the file system, or generate
a dynamic HTML document from a PHP script.
Let's concentrate on how to handle non ASCII characters in static HTML documents first.
Here are the steps and technologies involved in entering a HTML document and delivering
it to the uer interface:
H1. Key Sequences from keyboard
|
|- Text editor
v
H2. HTML Document
|
|- Web server
v
H3. HTTP Response
|
|- Internet TCP/IP Connection
v
H4. HTTP Response
|
|- Web browser
v
H5. Visiual characters on the Screen
Based on my experience, here are some basic rules related to those steps:
1. You must decide on the character encoding schema to be used in the HTML document first.
For most of the languages, you have two options, a: use a encoding schema specific to that language;
b: use a Unicode schema. For example, you can use either GB2312 (a simplified Chinese character schema)
or UTF-8 (a Unicode character schema) for Chinese characters. My suggestion used to be "a". But today,
I am suggesting "b", because Unicode schema can support all characters of all languages.
2. PHP seems to be a nice language. The data type of string is defined as a sequence of bytes,
like C language. This is different than Java language, where string is defined as a sequence of
Unicode characters. String literals in PHP can take any sequence of bytes. Therefore you can enter
non ASCII characters as PHP string literals in any encoding schema.
3. From step "H1" to "H2", you need select good text editor that supports the encoding schema you have selected.
The end goal of this step is simple - characters in the HTML documents must be stored in a file using the
selected encoding schema. Don't under estimate the difficulty level of this step. It could be very frustrating,
because most computer keyboards support alphabetic letters only. You may have to use some language specific
input software to translate alphabetic letters into language specific characters. The editor sometimes may
also store characters in memory in one encoding schema, and offer you different encoding schema when saving
files to harddisk.
4. From step "H3" to "H4", it is the job between the Web server and Web browser. The HTTP response will be
transmitted as is to the browser. The characters in the HTML document attached in the HTTP response will
also be maintained as is.
5. From step "H4" to "H5", the browser opens the received HTML document and displays encoded characters
into as written characters of the specific language. To do this, the browser needs your help. The first help
is to specify the character encoding name, "charset", used in the HTML document as a <meta> tag.
The second help is to make sure the browser can access the a character font file designed for the specified encoding schema.
If no character encoding name is specified in the <meta> tag, some browsers will try to detect the
encoding schema based on the HTML document content. If not successful, browsers will use default encoding
schemas. For example, Internet Explorer (IE) use "Western European" as the default encoding schema.
"Western European" seems to be referring to "ISO-8859-1" standard.
(Continued on next part...)
Part:
1
2
3
4
|