Restore Corrupted Chinese Text

∟Chinese Text Encoding Conversion and Corruptions

∟Restore Corrupted Chinese Text

This section provides a tutorial example to restore corrupted Chinese text with different suggested solutions.

How to restore corrupted Chinese text? This no single solution that can be applied to all situations of corrupted Chinese text. As you can see from previous tutorials, the original Chinese text could be stored in 3 or more encodings. The next system could mistakenly decoded with one of many possible 8-bit encodings. The situation could get even worse, if the corrupted Chinese text get decoded again by a third system with another 8-bit encoding.

However, I have suggestions that could help you to restore corrupted Chinese text.

1. Backup a copy the corrupted Chinese text from further corruption by converting it into hexadecimal format.

2. Determine the current encoding of the corrupted Chinese text. This can be done by reviewing its binary strings. And also assume that a large number of Chinese systems use GB18030 (GBK) encoding. Most of other systems use UTF-8. Assume that the current encoding is EncodingY.

3. Determine the encoding used by the last system to decode the original Chinese text. This is the most difficult step. You need to review those strange characters displayed from the corrected Chinese text. And try to map them to a guessed 8-bit encoding table. Assume that the guessed encoding the corrupted Chinese text was decoded from is EncodingX.

4. Try to convert corrupted Chinese text to the original Chinese text by calling a conversion tool like "iconv" on a Linux system.

  iconv -f EncodingY -t EncodingX corrupted_text > recovered_text

5. Display "recovered_text" as UTF-8, UTF-16BE, or GB18030. If it is showing as valid Chinese text. Job is done. Otherwise go back to step 3.

Let's apply those suggestions on the corrupted file name coming out from a ZIP file on my macOS computer.

1. Backup a copy the corrupted Chinese text into a hexadecimal format using "xxd corrupted.txt > corrupted.hex" command as shown in the picture below.

2. Determine the current encoding of the corrupted Chinese text by looking that the binary data. Since I am using a macOS computer, my first guess is UTF-8. To confirm this, I take the first 3 bytes from the binary data: 0x'e28891', and searches on the Internet. It is a valid 3-byte sequence of UTF-8 encoding. So EncodingY = UTF-8.

3. Determine the encoding used by the last system to decode the original Chinese text, by looking at those junk characters displayed from the corrupted text. The ISO-8859-1 table has the Japanese Yen sign, but not the upper Sigma letter. The CP437 encoding table has the upper Sigma letter, but not the diamond sign. Since I am using a macOS computer, let's see the MAC encoding table by running "php 8-Bit-Encoding-Table.php MAC" as shown below. Luckily, the MAC encoding table has all those junk characters. So let's set EncodingX = MAC.

4. Try to convert corrupted Chinese text to the original Chinese text by calling "iconv" command:

herong$ iconv -f UTF-8 -t MAC corrupted.txt > recovered.txt 

herong$ xxd recovered.txt 
00000000: b7d6 d7d3 b4fd d7a2 b2e1                 ..........

5. Display the recovered text as a Chinese encoding. By looking at the binary string of the recovered text, it seems to be GB18030 encoding. Let's try to display it using "iconv -f GB18030 -t UTF-8 recovered.txt | more" command.