This section describes how to count multi-byte characters using php_mbstring.dll module.
Once you have configured PHP to use php_mbstring.dll module, you are ready to use multibyte string functions to manipulate
Chinese character strings as characters instead of bytes.
Here is simple example PHP script using mb_strlen() to count Chinese characters in a string:
<?php #Count-UTF-8.php
# Copyright (c) 2007 by Dr. Herong Yang, http://www.herongyang.com/
#
$help_simplified = '这是一份非常间单的说明书…';
$help_traditional = '這是一份非常間單的說明書…';
$help_gb18030 = '?????????????';
$help_big5 = '?????????????';
print('<html>');
print('<meta http-equiv="Content-Type"'.
' content="text/html; charset=utf-8"/>');
print('<body>');
# Showing UTF-8 characters
print('<b>Chinese string in UTF-8 in PHP</b><br/>');
print('UTF-8 simplified characters: '.$help_simplified.'<br/>');
print('UTF-8 traditional characters: '.$help_traditional.'<br/>');
# Trying to show GB18030 characters
print('<b>GB18030 string included in a UTF-8 page</b><br/>');
print('GB18030 characters: '.$help_gb18030.'<br/>');
# Trying to show Big5 characters
print('<b>Big5 string included in a UTF-8 page</b><br/>');
print('Big5 characters: '.$help_big5.'<br/>');
# Counting UTF-8 characters
print('<b>Count UTF-8 characters in strings:</b><br/>');
print('UTF-8 simplified characters: '
.mb_strlen($help_simplified).'<br/>');
print('UTF-8 traditional characters: '
.mb_strlen($help_traditional).'<br/>');
print('GB18030 characters: '.mb_strlen($help_gb18030).'<br/>');
print('Big5 characters: '.mb_strlen($help_big5).'<br/>');
# Counting bytes
print('<b>Count UTF-8 characters in strings:</b><br/>');
print('UTF-8 simplified characters: '
.strlen($help_simplified).'<br/>');
print('UTF-8 traditional characters: '
.strlen($help_traditional).'<br/>');
print('GB18030 characters: '.strlen($help_gb18030).'<br/>');
print('Big5 characters: '.strlen($help_big5).'<br/>');
print('</body>');
print('</html>');
?>
Here is the Web page generated from this PHP script:
Look at the Web page carefully, you will see:
mb_strlen() counted Chinese characters correctly on UTF-8 encoded strings.
13 characters in both simplified and traditional character strings. The '...' as the end
both strings is 1 special Chinese character.
mb_strlen() counted Chinese characters incorrectly on both GB18030 and Big5 encoded strings.
This is unstandable, because mb_strlen() is assuming UTF-8 encoding based on the PHP configuration
settings in php.ini. In for mb_strlen() to work correctly, you need to change the setting to mbstring.internal_encoding = GB18030.
Comparing with counts of bytes returned by strlen(), we known that 1 Chinese character is mapped to 3 bytes
in UTF-8 encoding, 13 characters vs. 39 bytes. This is only true in this test. There may be some Chinese characters
that need to be mapped to 4 bytes in UTF-8 encoding.
Looking at byte counts of GB18030 and Big5 character strings, we know that 1 Chinese character is mapped
to 2 bytes in GB18030 and Big5 encodings. Again, this is only true in this test. Some GB18030 characters are mapped to 4 bytes.