PHP Modules Tutorials - Herong's Tutorial Examples - v5.18, by Herong Yang
Remove Dummy Elements in HTML Documents
This section provides a tutorial example on how to remove dummy elements in HTML documents, including empty elements, elements with whitespaces only, and redundant whitespaces.
In some HTML documents, especially generated by HTML editors, you may see many dummy (unnecessary or redundant) elements as listed below:
1. Sequences of multiple wightspace characters generated by mistakes. For example, <p> Hello world! </p>. We can replace with a single space. But we need to keep them inside <pre>.
2. Empty inline elements generated by mistakes. For example, <span></span>. We can remove them. But we need to keep <img>, <br>, <td>, and <meta>.
3. Inline elements with wightspaces only generated by mistakes. For example, <span> </span>. We can replace them with a single space. But we need to keep them in <pre>.
4. Empty block elements generated by mistakes or used for graphical presentations like vertical spaces, background or borders. For example, <div></div>. We can remove them, if we don't care about those graphical presentations.
5. Block elements with wightspaces only generated by mistakes or used for graphical presentations like vertical spaces, background or borders. For example, <div> </div>. We can remove them, if we don't care about those graphical presentations.
6. Other special elements for vertical spaces, like <div><br></div>.
Here is my PHP script, Remove-Dummy-Elements-in-HTML.php, that removes dummy elements in an HTML document. It also collapses whitespaces in non <pre> elements.
<?php # Remove-Dummy-Elements-in-HTML.php #- Copyright 2009 (c) HerongYang.com. All Rights Reserved. $input = $argv[1]; $html = file_get_contents($input); $doc = new DOMDocument(); $doc->loadHTML($html); $hasDummies = true; $i = 0; while ($hasDummies) { $i++; print("Removing dummies and whitespaces: $i\n"); collapseSpaces($doc); $hasDummies = removeDummies($doc); # After removing "empty", text-empty-text becomes text-text # We need to combine those text neighbors. $html = $doc->saveHTML(); $doc->loadHTML($html); } print($doc->saveHTML()); function collapseSpaces($node) { $oldList = array(); $newList = array(); $childList = $node->childNodes; if (isset($childList)) foreach ($childList as $n) { $t = $n->nodeName; if ($n->nodeType==XML_TEXT_NODE) { $v = $n->nodeValue; $v = preg_replace('/\s{2,}/', " ", $v); $nn = $node->ownerDocument->createTextNode($v); array_push($oldList, $n); array_push($newList, $nn); } else if ( !($t=="pre") && $n->hasChildNodes() ) { collapseSpaces($n); } } for ($i=0; $i<count($oldList); $i++) $node->replaceChild($newList[$i], $oldList[$i]); } function removeDummies($node) { $hasDummies = false; $junks = array(); $oldList = array(); $newList = array(); $childList = $node->childNodes; if (isset($childList)) foreach ($childList as $n) { $t = $n->nodeName; # skip some special elements if ( $n->nodeType==XML_DOCUMENT_TYPE_NODE || $n->nodeType==XML_DOCUMENT_NODE) continue; if ( $t=="#text" || $t=="img" || $t=="br" || $t=="td" || $t=="meta" ) continue; if ($n->hasChildNodes()) { $childList = $n->childNodes; if ($childList->length==1) { # $n has a single child $i = $childList->item(0); $it = $i->nodeType; $iv = $i->nodeValue; if ( $it==XML_TEXT_NODE && preg_match('/^\s+$/', $iv) ) { # $n has a single text child with whitespaces only $nn = $node->ownerDocument->createTextNode(" "); array_push($oldList, $n); array_push($newList, $nn); $hasDummies = true; } else if ($i->nodeName=="br") { # $n has a single "br" child array_push($junks, $n); $hasDummies = true; } else { $h = removeDummies($n); $hasDummies = $hasDummies || $h; } } else { # $n has multiple children $h = removeDummies($n); $hasDummies = $hasDummies || $h; } } else { # remove empties elements array_push($junks, $n); $hasDummies = true; } } foreach($junks as $n) $n->parentNode->removeChild($n); for ($i=0; $i<count($oldList); $i++) $node->replaceChild($newList[$i], $oldList[$i]); return $hasDummies; } ?>
Notes about my script:
Here is an HTML document with some dummy elements and whitespaces,
<!-- With-Dummies.html - Copyright (c) 2009 HerongYang.com. All Rights Reserved. --> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <title></title> </head> <body bgcolor="#ddddff"> <p>Example<br>with<span> </span>Dummies</p> <div> <p> </p> </div> <pre> Copyright c HerongYang.com </pre> <p> Copyright c HerongYang.com </p> </body> </html>
Here is the output after its dummy elements are removed:
herong> php Remove-Dummy-Elements-in-HTML.php With-Dummies.html Removing dummies and whitespaces: 1 Removing dummies and whitespaces: 2 Removing dummies and whitespaces: 3 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" ...> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8">\ </head> <body bgcolor="#ddddff"> <p>Example<br>with Dummies</p> <pre> Copyright c HerongYang.com </pre> <p> Copyright c HerongYang.com </p> </body> </html>
The output tells me that:
I think my script is pretty safe to use on any non-input HTML documents. You need to revise it, if you want to use it on HTML documents with input elements.
Table of Contents
Introduction and Installation of PHP
Managing PHP Engine and Modules on macOS
Managing PHP Engine and Modules on CentOS
►DOM Module - Parsing HTML Documents
DOM (Document Object Model) Module
Parse and Traverse HTML Documents
Load HTML Documents with LIBXML_NOBLANKS
Remove Whitespaces in HTML Documents
DOCTYPE Element in HTML Documents
►Remove Dummy Elements in HTML Documents
Install DOM Extension on CentOS
GD Module - Manipulating Images and Pictures
MySQLi Module - Accessing MySQL Server
OpenSSL Module - Cryptography and SSL/TLS Toolkit
PCRE Module - Perl Compatible Regular Expressions
SOAP Module - Creating and Calling Web Services
SOAP Module - Server Functions and Examples