Remove Dummy Elements in HTML Documents

This section provides a tutorial example on how to remove dummy elements in HTML documents, including empty elements, elements with whitespaces only, and redundant whitespaces.

In some HTML documents, especially generated by HTML editors, you may see many dummy (unnecessary or redundant) elements as listed below:

1. Sequences of multiple wightspace characters generated by mistakes. For example, <p> Hello world! </p>. We can replace with a single space. But we need to keep them inside <pre>.

2. Empty inline elements generated by mistakes. For example, <span></span>. We can remove them. But we need to keep <img>, <br>, <td>, and <meta>.

3. Inline elements with wightspaces only generated by mistakes. For example, <span> </span>. We can replace them with a single space. But we need to keep them in <pre>.

4. Empty block elements generated by mistakes or used for graphical presentations like vertical spaces, background or borders. For example, <div></div>. We can remove them, if we don't care about those graphical presentations.

5. Block elements with wightspaces only generated by mistakes or used for graphical presentations like vertical spaces, background or borders. For example, <div> </div>. We can remove them, if we don't care about those graphical presentations.

6. Other special elements for vertical spaces, like <div><br></div>.

Here is my PHP script, Remove-Dummy-Elements-in-HTML.php, that removes dummy elements in an HTML document. It also collapses whitespaces in non <pre> elements.

<?php
#  Remove-Dummy-Elements-in-HTML.php
#- Copyright 2009 (c) HerongYang.com. All Rights Reserved.

  $input = $argv[1];
  $html = file_get_contents($input);
  $doc = new DOMDocument();

  $doc->loadHTML($html);

  $hasDummies = true;
  $i = 0;
  while ($hasDummies) {
    $i++;
    print("Removing dummies and whitespaces: $i\n");

    collapseSpaces($doc);
    $hasDummies = removeDummies($doc);

    # After removing "empty", text-empty-text becomes text-text
    # We need to combine those text neighbors.
    $html = $doc->saveHTML();
    $doc->loadHTML($html);

  }

  print($doc->saveHTML());

function collapseSpaces($node) {
  $oldList = array();
  $newList = array();

  $childList = $node->childNodes;
  if (isset($childList)) foreach ($childList as $n) {
    $t = $n->nodeName;
    if ($n->nodeType==XML_TEXT_NODE) {
      $v = $n->nodeValue;
      $v = preg_replace('/\s{2,}/', " ", $v);
      $nn = $node->ownerDocument->createTextNode($v);
      array_push($oldList, $n);
      array_push($newList, $nn);
    } else if ( !($t=="pre") && $n->hasChildNodes() ) {
      collapseSpaces($n);
    }
  }

  for ($i=0; $i<count($oldList); $i++)
    $node->replaceChild($newList[$i], $oldList[$i]);
}

function removeDummies($node) {
  $hasDummies = false;

  $junks = array();
  $oldList = array();
  $newList = array();

  $childList = $node->childNodes;
  if (isset($childList)) foreach ($childList as $n) {
    $t = $n->nodeName;

    # skip some special elements
    if ( $n->nodeType==XML_DOCUMENT_TYPE_NODE
      || $n->nodeType==XML_DOCUMENT_NODE) continue;
    if ( $t=="#text" || $t=="img" || $t=="br" || $t=="td"
      || $t=="meta" ) continue;

    if ($n->hasChildNodes()) {

      $childList = $n->childNodes;
      if ($childList->length==1) { # $n has a single child

        $i = $childList->item(0);
        $it = $i->nodeType;
        $iv = $i->nodeValue;
        if ( $it==XML_TEXT_NODE && preg_match('/^\s+$/', $iv) ) {
          # $n has a single text child with whitespaces only
          $nn = $node->ownerDocument->createTextNode(" ");
          array_push($oldList, $n);
          array_push($newList, $nn);
          $hasDummies = true;

        } else if ($i->nodeName=="br") {
          # $n has a single "br" child
          array_push($junks, $n);
          $hasDummies = true;

        } else {
          $h = removeDummies($n);
          $hasDummies = $hasDummies || $h;
        }

      } else { # $n has multiple children
        $h = removeDummies($n);
        $hasDummies = $hasDummies || $h;
      }

    } else { # remove empties elements
      array_push($junks, $n);
      $hasDummies = true;
    }

  }
  foreach($junks as $n) $n->parentNode->removeChild($n);
  for ($i=0; $i<count($oldList); $i++)
    $node->replaceChild($newList[$i], $oldList[$i]);

  return $hasDummies;
}
?>

Notes about my script:

Here is an HTML document with some dummy elements and whitespaces,

<!-- With-Dummies.html
 - Copyright (c) 2009 HerongYang.com. All Rights Reserved.
-->
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title></title>
</head>
<body bgcolor="#ddddff">
  <p>Example<br>with<span> </span>Dummies</p>
  <div>
    <p> </p>
  </div>
  <pre>
    Copyright   c   HerongYang.com
  </pre>
  <p>
    Copyright   c   HerongYang.com
  </p>
</body>
</html>

Here is the output after its dummy elements are removed:

herong> php Remove-Dummy-Elements-in-HTML.php With-Dummies.html

Removing dummies and whitespaces: 1
Removing dummies and whitespaces: 2
Removing dummies and whitespaces: 3

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" ...>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">\
 </head>
<body bgcolor="#ddddff"> <p>Example<br>with Dummies</p> <pre>
    Copyright   c   HerongYang.com
  </pre> <p> Copyright c HerongYang.com </p>
</body>
</html>

The output tells me that:

I think my script is pretty safe to use on any non-input HTML documents. You need to revise it, if you want to use it on HTML documents with input elements.

Table of Contents

 About This Book

 Introduction and Installation of PHP

 Managing PHP Engine and Modules on macOS

 Managing PHP Engine and Modules on CentOS

 cURL Module - Client for URL

DOM Module - Parsing HTML Documents

 DOM (Document Object Model) Module

 Parse and Traverse HTML Documents

 Build New HTML Documents

 Load HTML Documents with LIBXML_NOBLANKS

 Remove Whitespaces in HTML Documents

 DOCTYPE Element in HTML Documents

Remove Dummy Elements in HTML Documents

 Install DOM Extension on CentOS

 GD Module - Manipulating Images and Pictures

 MySQLi Module - Accessing MySQL Server

 OpenSSL Module - Cryptography and SSL/TLS Toolkit

 PCRE Module - Perl Compatible Regular Expressions

 SOAP Module - Creating and Calling Web Services

 SOAP Module - Server Functions and Examples

 Zip Module - Managing ZIP Archive Files

 References

 Full Version in PDF/EPUB