Chinese Web Sites Using PHP - v2.23, by Herong Yang
Root Cause of Corrupted Chinese Text
This section provides a tutorial example to demonstrate the root cause of corrupted Chinese text - incorrect 8-bit encodings are used to decode original Chinese text.
Based on my experience, the root cause of corrupted Chinese text can be divided into the following 3 encoding processing mistakes:
1. The original Chinese text is encoded in UTF-8, but it gets decoded as one of the 8-bit encodings.
2. The original Chinese text is encoded in Unicode (UTF-16BE), but it gets decoded as one of the 8-bit encodings.
3. The original Chinese text is encoded in GB18030 (GBK), but it gets decoded as one of the 8-bit encodings.
There are many 8-bit encodings. And here are some examples:
With 3 possible encodings in the original text and many possible decoding mistaken options, the resulting corrupted Chinese text will be a large number of variations. The following PHP script shows 12 of possible corrupted output from a single Chinese text:
<?php #- Chinese-Corrupted-Encoding.php #- Copyright (c) 2005 HerongYang.com. All Rights Reserved. # Original in Unicode (UTF-16BE) $original = hex2bin('7b804f534e2d65877f519875'); corrupted($original, "UTF-16BE"); # Original in UTF-8 $original = hex2bin('e7ae80e4bd93e4b8ade69687e7bd91e9a1b5'); corrupted($original, "UTF-8"); # Original in GB18030 (GBK) $original = hex2bin('bcf2cce5d6d0cec4cdf8d2b3'); corrupted($original, "GB18030"); function corrupted($original, $encoding) { print("\nOriginal encoding: ".$encoding."$\n"); print(" Text: ".iconv($encoding, "UTF-8", $original)."$\n"); print(" Binary: ".bin2hex($original)."$\n"); print(" Corrupted as:\n"); print(" ISO-8859-1: " .iconv("ISO-8859-1", "UTF-8//IGNORE", $original)."$\n"); print(" CP437: ".iconv("CP437", "UTF-8//IGNORE", $original)."$\n"); print(" CP852: ".iconv("CP852", "UTF-8//IGNORE", $original)."$\n"); print(" CP932: ".iconv("CP932", "UTF-8//IGNORE", $original)."$\n"); } ?>
If you run this test PHP script on system that supports Chinese characters in UTF-8 encoding, you should see the follow:
See next tutorials for suggestions on how to avoid and recover from corrupted Chinese text.
Table of Contents
PHP Installation on Windows Systems
Integrating PHP with Apache Web Server
charset="*" - Encodings on Chinese Web Pages
Chinese Characters in PHP String Literals
Multibyte String Functions in UTF-8 Encoding
Input Text Data from Web Forms
Input Chinese Text Data from Web Forms
MySQL - Installation on Windows
MySQL - Connecting PHP to Database
MySQL - Character Set and Encoding
MySQL - Sending Non-ASCII Text to MySQL
Retrieving Chinese Text from Database to Web Pages
Input Chinese Text Data to MySQL Database
►Chinese Text Encoding Conversion and Corruptions
Detect System Default Encoding
►Root Cause of Corrupted Chinese Text
Corrupted Chinese File Name with Un-ZIP
Generate 8-Bit Encoding Tables
Restore Corrupted Chinese Text