|
Non ASCII Characters as String Literals
Part:
1
2
3
4
5
This chapter explains:
- Basic Rules
- French Characters in String Literals - UTF-8 Encoding
- French Characters in String Literals - ISO-8859-1 Encoding
- Chinese Characters in String Literals - UTF-8 Encoding
- Chinese Characters in String Literals - GB2312 Encoding
- Characters of Multiple Languages in String Literals
Basic Rules
As you see from the previous chapters, when PHP scripts are involved in a Web based application,
they are always used behind a Web server. PHP scripts are expected to generate HTML documents and
pass them back to the Web server. There are about four ways non ASCII characters can get into the HTML document
through PHP scripts: a) Enter them as string literals; b) Receive from HTTP request; c) Retrieve them from files;
d) Retrieve them from a database.
In this chapter, we will concentrate on how to include non ASCII characters in PHP scripts as string literals.
Here are the steps involved in this scenario:
A1. Key Sequences from keyboard
|
|- Text editor
v
A2. PHP File
|
|- PHP CGI engine
v
A3. HTML Document
Based on my experience, here are some basic rules related to those steps:
1. You must decide on the character encoding schema to be used in your PHP script file.
For most of the languages, you have two options, a: use a encoding schema specific to that language;
b: use a Unicode schema. For example, you can use either GB2312 (a simplified Chinese character schema)
or UTF-8 (a Unicode character schema) for Chinese characters. My suggestion used to be "a". But today,
I am suggesting "b", because Unicode schema can support all characters of all languages.
2. From step "A1" to "A2", you need select good text editor that supports the encoding schema you have
decided. The end goal of this step is simple - characters in string literals must be stored in the PHP
file using the decided encoding schema.
Don't under estimate the difficulty level of this step. It could be very frustrating,
because most computer keyboards support alphabetic letters only. You may have to use some language specific
input software to translate alphabetic letters into language specific characters. The editor sometimes may
also store characters in memory in one encoding schema, and offer you different encoding schema when saving
files to harddisk.
3. String data type is defined as a sequence of bytes in PHP, like C language. This is different than
Java language, where string data type is defined as a sequence of Unicode characters. String literals in
PHP are also taken as sequences of bytes. This is a nice feature. It allows us to enter non ASCII characters
in almost any encoding schema.
4. All PHP built-in string functions assume that strings are sequences of bytes. For example, strlen()
returns the number of bytes of the given string, not the number of characters of a specific language.
To manage strings as sequences of characters, we need to use Multibyte String functions, mb_*().
5. From step "A2" to "A3", HTML documents are generated from PHP script mainly through the print() function.
The print() function will nicely copy every bytes from the specified string to HTML documents. This guarantees
that any non ASCII characters encoded in any encoding schema will be copied correctly to the HTML document.
Again, this is different than JSP pages, where strings will be converted into bytes stream based a specified
encoding schema, if you are using character based output stream functions.
6. If you do want to convert from one encoding schema to another encoding schema during the print() function
call, you can use mb_output_handler as the call back function on the output buffer: ob_start("mb_output_handler").
(Continued on next part...)
Part:
1
2
3
4
5
|