PHP Modules Tutorials - Herong's Tutorial Examples - v5.18, by Herong Yang
Remove Whitespaces in HTML Documents
This section provides a tutorial example on how to remove whitespaces in HTML Documents with PHP DOM Extension.
If you write your HTML documents following XML syntax rules, you can write a PHP script to remove whitespaces in both <head> and <body> elements.
Here is my version, Remove-Whitespaces-in-HTML.php;
<?php # Remove-Whitespaces-in-HTML.php #- Copyright 2009 (c) HerongYang.com. All Rights Reserved. $input = $argv[1]; $html = file_get_contents($input); $doc = new DOMDocument(); $doc->loadHTML($html); removeBlanks($doc); print($doc->saveHTML()); function removeBlanks($node) { $junks = array(); $oldList = array(); $newList = array(); $childList = $node->childNodes; if (isset($childList)) foreach ($childList as $n) { if ($n->nodeType==XML_TEXT_NODE) { $t = $n->nodeValue; $t = trim($t); if (strlen($t)==0) { # not safe to touch $node while looping its $childList # $node->removeChild($n); array_push($junks, $n); } else { $nn = $node->ownerDocument->createTextNode($t); # not safe to touch $node while looping its $childList # $node->replaceChild($nn, $n); array_push($oldList, $n); array_push($newList, $nn); } } else { removeBlanks($n); } } foreach($junks as $n) $node->removeChild($n); for ($i=0; $i<count($oldList); $i++) $node->replaceChild($newList[$i], $oldList[$i]); } ?>
Note that it is not safe to touch a DOMNode object while looping its child list. The child list is forced to rebuild whenever its parent node is touched.
This is why we should not call removeChild() and replaceChild() in the first loop. We need to cache those changes and get them done after the first loop is completed.
Try it out on the Hello-Formatted.html file:
herong> type Hello-Formatted.html <html> <head> <title> Hello </title> </head> <body bgcolor="#ddddff"> <p> Hello World! </p> </body> </html> herong$> php Remove-Whitespaces-in-HTML.php Hello-Formatted.html <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" ...> <html><head><title>Hello</title></head><body bgcolor="#ddddff">\ <p>Hello World!</p></body></html>
Cool. All whitespaces are removed from the HTML document!
Remove all whitespaces in an HTML document is fine if it's perfectly written in XML style and has no <pre> elements.
But for most HTML documents, whitespaces do have meanings:
So, do not use this script on any real HTML documents.
Table of Contents
Introduction and Installation of PHP
Managing PHP Engine and Modules on macOS
Managing PHP Engine and Modules on CentOS
►DOM Module - Parsing HTML Documents
DOM (Document Object Model) Module
Parse and Traverse HTML Documents
Load HTML Documents with LIBXML_NOBLANKS
►Remove Whitespaces in HTML Documents
DOCTYPE Element in HTML Documents
Remove Dummy Elements in HTML Documents
Install DOM Extension on CentOS
GD Module - Manipulating Images and Pictures
MySQLi Module - Accessing MySQL Server
OpenSSL Module - Cryptography and SSL/TLS Toolkit
PCRE Module - Perl Compatible Regular Expressions
SOAP Module - Creating and Calling Web Services
SOAP Module - Server Functions and Examples