Decoding HTML Entities

This section provides a tutorial example on how decode HTML entities received from HTML forms with iso-8859-1 encoding for non-ASCII characters.

As you see earlier in this chapter, if page has "charset=iso-8859-1", Unicode characters will be received as HTML entities in $_REQUEST. How can we convert them back to Unicode characters?

I have tried with "urldecode()" and "rawurldecode()". They work fine on single-byte characters. But they do not work with multi-byte characters.

PHP has a special function "html_entity_decode()" to decode HTML entities with multi-byte characters. Here is the syntax of html_entity_decode():

   html_entity_decode(string[, quote_style[, charset]])

where "string" is the HTML entity encoded string; "quote_style" specifies how quotes should be handled; and "charset" specifies which character set to use. Supported character sets include: ISO-8859-1, UTF-8, cp1251, GB2312, and Shift_JIS.

To show you how to use html_entity_decode(), I modified InputIsoGet.php to InputIsoGetDecoded.php:

<?php
#  InputIsoGetDecoded.php
#- Copyright 2009 (c) HerongYang.com. All Rights Reserved.
#
#- Promoting CGI values to local variables
   global $r_English, $r_Spanish, $r_Korean, $r_ChineseUtf8;
   global $r_ChineseGb2312;
   import_request_variables("GPC","r_");

#- Generating HTML document
   print("<html>");
   print('<meta http-equiv="Content-Type"'
      .' content="text/html; charset=utf-8"/>');
   print("<body>\n");
   print("<form action=InputIsoGetDecoded.php method=get>");
   print("English ASCII: <input name=English"
      ." value='$r_English' size=16><br>\n");
   print("Spanish UTF-8: <input name=Spanish"
      ." value='$r_Spanish' size=16><br>\n");
   print("Korean UTF-8: <input name=Korean"
      ." value='$r_Korean' size=16><br>\n");
   print("Chinese UTF-8: <input name=ChineseUtf8"
      ." value='$r_ChineseUtf8' size=16><br>\n");
   print("Chinese GB2312: <input name=ChineseGb2312"
      ." value='$r_ChineseGb2312' size=16><br>\n");
   print("<input type=submit name=submit value=Submit>\n");
   print("</form>\n");

#- Outputing input strings back to HTML document
   print("<hr>");
   print("<pre>");
   print("Input strings before decoding:\n");
   foreach ($_GET as $k => $v) {
      print "$k = ($v)\n";
   }
   print("</pre>");

#- Outputing input strings back to HTML document - decoded
   print("<hr>");
   print("<pre>");
   print("Input strings after decoding:\n");
   foreach ($_GET as $k => $v) {
      print("$k = (".html_entity_decode($v,ENT_COMPAT,"UTF-8").")\n");
   }
   print("</pre>");
   print("</body>");
   print("</html>");

#- Dumping input strings to a file
   $file = fopen("\\temp\\InputIsoGet.txt", 'ab');
   $str = "------\n";
   fwrite($file, $str, strlen($str));
   if (array_key_exists('QUERY_STRING',$_SERVER)) {
      $str = $_SERVER['QUERY_STRING'];
   } else {
      $str = NULL;
   }
   fwrite($file, $str, strlen($str));

   $str = "------\n";
   fwrite($file, $str, strlen($str));
   foreach ($_REQUEST as $k => $v) {
      $str = "$k = ($v)\n";
      fwrite($file, $str, strlen($str));
   }
   fclose($file);
?>

Now enter the following input strings on InputIsoGetDecoded.php to see what happens:

English ASCII: Hello world!
Spanish UTF-8: ¡Hola mundo!
Korean UTF-8: ???? ?? !
Chinese UTF-8: ????!
Chinese GB2312: ????¡

If you click the submit button, you will get:

Input strings before decoding:
English = (Hello world!)
Spanish = (¡Hola mundo!)
Korean = (???? ?? !)
ChineseUtf8 = (????!)
ChineseGb2312 = (????¡)
submit = (Submit)
------
Input strings after decoding:
English = (Hello world!)
Spanish = (¡Hola mundo!)
Korean = (여보세요 세계 !)
ChineseUtf8 = (仠好世界!)
ChineseGb2312 = (????¡)
submit = (Submit)

The first section shows you input strings as they are received in HTML entity encoding. The second section shows you input strings as they are decoded from HTML entity encoding to UTF-8 encoding.

Table of Contents

 About This Book

 Introduction and Installation of PHP

 PHP Script File Syntax

 PHP Data Types and Data Literals

 Variables, References, and Constants

 Expressions, Operations and Type Conversions

 Conditional Statements - "if" and "switch"

 Loop Statements - "while", "for", and "do ... while"

 Function Declaration, Arguments, and Return Values

 Arrays - Ordered Maps

 Interface with Operating System

 Introduction of Class and Object

 Integrating PHP with Apache Web Server

 Retrieving Information from HTTP Requests

 Creating and Managing Sessions in PHP Scripts

 Sending and Receiving Cookies in PHP Scripts

 Controlling HTTP Response Header Lines in PHP Scripts

 Managing File Upload

 Functions to Manage Directories, Files and Images

 Localization Overview of Web Applications

 Using Non-ASCII Characters in HTML Documents

 Using Non-ASCII Characters as PHP Script String Literals

Receiving Non-ASCII Characters from Input Forms

 Basic Rules of Receiving Non-ASCII Characters from Input Forms

 Receiving Non-ASCII Characters with GET Method

 Receiving Non-ASCII Characters with POST Method

 Receiving Non ASCII Characters in UTF-8 Encoding

Decoding HTML Entities

 "mbstring" Extension and Non-ASCII Encoding Management

 Managing Non-ASCII Character Strings with MySQL Servers

 Configuring and Sending Out Emails

 Managing PHP Engine and Modules on macOS

 Managing PHP Engine and Modules on CentOS

 Archived Tutorials

 References

 Full Version in PDF/EPUB