PHP Modules Tutorials - Herong's Tutorial Examples - v5.18, by Herong Yang
Load HTML Documents with LIBXML_NOBLANKS
This section provides a tutorial example on how to load HTML documents with the LIBXML_NOBLANKS using the PHP DOM Extension. It only removes whitespaces between tags inside the 'head' element.
In HTML documents, whitespace (any sequence of space, tab, carriage return and line feed) is largely ignored. To be precise, whitespace in between words is treated as a single character, and whitespace before and after element tags is ignored.
Whitespace is usually inserted into an HTML document to make it into a readable format. Here is an example of HTML document formatted with whitespaces, Hello-Formatted.html:
<html>
<head>
<title>
Hello
</title>
</head>
<body bgcolor="#ddddff">
<p>
Hello World!
</p>
</body>
</html>
To save storage space, you may want to remove whitespaces, also called blanks from a formatted HTML document.
According to the DOM Extension documentation, the LIBXML_NOBLANKS option on the loadHTML() method can be used to remove blank nodes.
Here is my test script, Load-HTML-with-NOBLANKS.php:
<?php
# Load-HTML-with-NOBLANKS.php
#- Copyright 2009 (c) HerongYang.com. All Rights Reserved.
$input = $argv[1];
$option = $argv[2];
$html = file_get_contents($input);
$doc = new DOMDocument();
if ($option=="noblanks") {
$doc->loadHTML($html, LIBXML_NOBLANKS);
} else {
$doc->loadHTML($html);
}
print("\n------ HTML Document ------\n");
print($doc->saveHTML());
print("\n------ DOMNode Tree ------\n");
printNode($doc, "");
function printNode($node, $prefix) {
$attriLen = -1;
$attriList = $node->attributes;
if (isset($attriList)) $attriLen = $attriList->length;
$childLen = -1;
$childList = $node->childNodes;
if (isset($childList)) $childLen = $childList->length;
print($prefix . $node->nodeName . "= "
. $attriLen . ", " . $childLen . ", " . $node->nodeType) . " ";
if ($node->nodeType==XML_TEXT_NODE)
print("text: (" . $node->nodeValue . ")\n");
else if ($node->nodeType==XML_ATTRIBUTE_NODE)
print("attribute: (" . $node->nodeValue . ")\n");
else
print("other: (...)\n");
if (isset($attriList)) foreach ($attriList as $n)
printNode($n, $prefix." @");
if (isset($childList)) foreach ($childList as $n)
printNode($n, $prefix." ");
}
?>
Test 1 - Run the script without LIBXML_NOBLANKS to see all blank nodes:
herong> php Load-HTML-with-NOBLANKS.php Hello-Formatted.html default
------ HTML Document ------
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" ...>
<html>
<head>
<title>
Hello
</title>
</head>
<body bgcolor="#ddddff">
<p>
Hello World!
</p>
</body>
</html>
------ DOMNode Tree ------
#document= -1, 2, 13 other: (...)
html= -1, -1, 10 other: (...)
html= 0, 5, 1 other: (...)
#text= -1, -1, 3 text: (
)
head= 0, 3, 1 other: (...)
#text= -1, -1, 3 text: (
)
title= 0, 1, 1 other: (...)
#text= -1, -1, 3 text: (
Hello
)
#text= -1, -1, 3 text: (
)
#text= -1, -1, 3 text: (
)
body= 1, 3, 1 other: (...)
@bgcolor= -1, 1, 2 attribute: (#ddddff)
@ #text= -1, -1, 3 text: (#ddddff)
#text= -1, -1, 3 text: (
)
p= 0, 1, 1 other: (...)
#text= -1, -1, 3 text: (
Hello World!
)
#text= -1, -1, 3 text: (
)
#text= -1, -1, 3 text: (
)
As you see from the output, the default behavior of loadHTML() preserves all whitespaces as "#text" nodes.
Test 2 - Run the script with LIBXML_NOBLANKS to see which blank nodes are removed.
herong> php Load-HTML-with-NOBLANKS.php Hello-Formatted.html noblanks
------ HTML Document ------
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" ...>
<html><head><title>
Hello
</title></head><body bgcolor="#ddddff">
<p>
Hello World!
</p>
</body></html>
------ DOMNode Tree ------
#document= -1, 2, 13 other: (...)
html= -1, -1, 10 other: (...)
html= 0, 2, 1 other: (...)
head= 0, 1, 1 other: (...)
title= 0, 1, 1 other: (...)
#text= -1, -1, 3 text: (
Hello
)
body= 1, 3, 1 other: (...)
@bgcolor= -1, 1, 2 attribute: (#ddddff)
@ #text= -1, -1, 3 text: (#ddddff)
#text= -1, -1, 3 text: (
)
p= 0, 1, 1 other: (...)
#text= -1, -1, 3 text: (
Hello World!
)
#text= -1, -1, 3 text: (
)
The output shows some interesting surprises to me:
I think that LIBXML_NOBLANKS option is trying to preserve whitespaces inside <body>, because they do have meanings in HTML documents. This is different than XML documents.
Table of Contents
Introduction and Installation of PHP
Managing PHP Engine and Modules on macOS
Managing PHP Engine and Modules on CentOS
►DOM Module - Parsing HTML Documents
DOM (Document Object Model) Module
Parse and Traverse HTML Documents
►Load HTML Documents with LIBXML_NOBLANKS
Remove Whitespaces in HTML Documents
DOCTYPE Element in HTML Documents
Remove Dummy Elements in HTML Documents
Install DOM Extension on CentOS
GD Module - Manipulating Images and Pictures
MySQLi Module - Accessing MySQL Server
OpenSSL Module - Cryptography and SSL/TLS Toolkit
PCRE Module - Perl Compatible Regular Expressions
SOAP Module - Creating and Calling Web Services
SOAP Module - Server Functions and Examples