Remove Whitespaces in HTML Documents

This section provides a tutorial example on how to remove whitespaces in HTML Documents with PHP DOM Extension.

If you write your HTML documents following XML syntax rules, you can write a PHP script to remove whitespaces in both <head> and <body> elements.

Here is my version, Remove-Whitespaces-in-HTML.php;

<?php
#  Remove-Whitespaces-in-HTML.php
#- Copyright 2009 (c) HerongYang.com. All Rights Reserved.

  $input = $argv[1];
  $html = file_get_contents($input);
  $doc = new DOMDocument();

  $doc->loadHTML($html);
  removeBlanks($doc);
  print($doc->saveHTML());

function removeBlanks($node) {
  $junks = array();
  $oldList = array();
  $newList = array();

  $childList = $node->childNodes;
  if (isset($childList)) foreach ($childList as $n) {
    if ($n->nodeType==XML_TEXT_NODE) {
      $t = $n->nodeValue;
      $t = trim($t);
      if (strlen($t)==0) {

        # not safe to touch $node while looping its $childList
        # $node->removeChild($n);
        array_push($junks, $n);
      } else {
        $nn = $node->ownerDocument->createTextNode($t);

        # not safe to touch $node while looping its $childList
        # $node->replaceChild($nn, $n);
        array_push($oldList, $n);
        array_push($newList, $nn);
      }
    } else {
      removeBlanks($n);
    }
  }

  foreach($junks as $n) $node->removeChild($n);

  for ($i=0; $i<count($oldList); $i++)
    $node->replaceChild($newList[$i], $oldList[$i]);
}
?>

Note that it is not safe to touch a DOMNode object while looping its child list. The child list is forced to rebuild whenever its parent node is touched.

This is why we should not call removeChild() and replaceChild() in the first loop. We need to cache those changes and get them done after the first loop is completed.

Try it out on the Hello-Formatted.html file:

herong> type Hello-Formatted.html

<html>
  <head>
    <title>
      Hello
    </title>
  </head>
  <body bgcolor="#ddddff">
    <p>
      Hello World!
    </p>
  </body>
</html>

herong$> php Remove-Whitespaces-in-HTML.php Hello-Formatted.html

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" ...>
<html><head><title>Hello</title></head><body bgcolor="#ddddff">\
<p>Hello World!</p></body></html>

Cool. All whitespaces are removed from the HTML document!

Remove all whitespaces in an HTML document is fine if it's perfectly written in XML style and has no <pre> elements.

But for most HTML documents, whitespaces do have meanings:

So, do not use this script on any real HTML documents.

Table of Contents

 About This Book

 Introduction and Installation of PHP

 Managing PHP Engine and Modules on macOS

 Managing PHP Engine and Modules on CentOS

 cURL Module - Client for URL

DOM Module - Parsing HTML Documents

 DOM (Document Object Model) Module

 Parse and Traverse HTML Documents

 Build New HTML Documents

 Load HTML Documents with LIBXML_NOBLANKS

Remove Whitespaces in HTML Documents

 DOCTYPE Element in HTML Documents

 Remove Dummy Elements in HTML Documents

 Install DOM Extension on CentOS

 GD Module - Manipulating Images and Pictures

 MySQLi Module - Accessing MySQL Server

 OpenSSL Module - Cryptography and SSL/TLS Toolkit

 PCRE Module - Perl Compatible Regular Expressions

 SOAP Module - Creating and Calling Web Services

 SOAP Module - Server Functions and Examples

 Zip Module - Managing ZIP Archive Files

 References

 Full Version in PDF/EPUB