Root Cause of Corrupted Chinese Text

This section provides a tutorial example to demonstrate the root cause of corrupted Chinese text - incorrect 8-bit encodings are used to decode original Chinese text.

Based on my experience, the root cause of corrupted Chinese text can be divided into the following 3 encoding processing mistakes:

1. The original Chinese text is encoded in UTF-8, but it gets decoded as one of the 8-bit encodings.

2. The original Chinese text is encoded in Unicode (UTF-16BE), but it gets decoded as one of the 8-bit encodings.

3. The original Chinese text is encoded in GB18030 (GBK), but it gets decoded as one of the 8-bit encodings.

There are many 8-bit encodings. And here are some examples:

With 3 possible encodings in the original text and many possible decoding mistaken options, the resulting corrupted Chinese text will be a large number of variations. The following PHP script shows 12 of possible corrupted output from a single Chinese text:

<?php 
#- Chinese-Corrupted-Encoding.php
#- Copyright (c) 2005 HerongYang.com. All Rights Reserved.

  # Original in Unicode (UTF-16BE)
  $original = hex2bin('7b804f534e2d65877f519875');
  corrupted($original, "UTF-16BE");
  
  # Original in UTF-8
  $original = hex2bin('e7ae80e4bd93e4b8ade69687e7bd91e9a1b5');
  corrupted($original, "UTF-8");
  
  # Original in GB18030 (GBK)
  $original = hex2bin('bcf2cce5d6d0cec4cdf8d2b3');
  corrupted($original, "GB18030");

function corrupted($original, $encoding) {
  print("\nOriginal encoding: ".$encoding."$\n");
  print("   Text: ".iconv($encoding, "UTF-8", $original)."$\n");
  print("   Binary: ".bin2hex($original)."$\n");
  print("   Corrupted as:\n");
  print("      ISO-8859-1: "
    .iconv("ISO-8859-1", "UTF-8//IGNORE", $original)."$\n");
  print("      CP437: ".iconv("CP437", "UTF-8//IGNORE", $original)."$\n");
  print("      CP852: ".iconv("CP852", "UTF-8//IGNORE", $original)."$\n");
  print("      CP932: ".iconv("CP932", "UTF-8//IGNORE", $original)."$\n");
}
?>

If you run this test PHP script on system that supports Chinese characters in UTF-8 encoding, you should see the follow:

Corrupted Chinese Text Examples
Corrupted Chinese Text Examples

See next tutorials for suggestions on how to avoid and recover from corrupted Chinese text.

Table of Contents

 About This Book

 PHP Installation on Windows Systems

 Integrating PHP with Apache Web Server

 charset="*" - Encodings on Chinese Web Pages

 Chinese Characters in PHP String Literals

 Multibyte String Functions in UTF-8 Encoding

 Input Text Data from Web Forms

 Input Chinese Text Data from Web Forms

 MySQL - Installation on Windows

 MySQL - Connecting PHP to Database

 MySQL - Character Set and Encoding

 MySQL - Sending Non-ASCII Text to MySQL

 Retrieving Chinese Text from Database to Web Pages

 Input Chinese Text Data to MySQL Database

Chinese Text Encoding Conversion and Corruptions

 Detect System Default Encoding

Root Cause of Corrupted Chinese Text

 Corrupted Chinese File Name with Un-ZIP

 Generate 8-Bit Encoding Tables

 Restore Corrupted Chinese Text

 Encoding-Convertor.php - Encoding Conversion Test

 Archived Tutorials

 References

 Full Version in PDF/EPUB