Load HTML Documents with LIBXML_NOBLANKS

This section provides a tutorial example on how to load HTML documents with the LIBXML_NOBLANKS using the PHP DOM Extension. It only removes whitespaces between tags inside the 'head' element.

In HTML documents, whitespace (any sequence of space, tab, carriage return and line feed) is largely ignored. To be precise, whitespace in between words is treated as a single character, and whitespace before and after element tags is ignored.

Whitespace is usually inserted into an HTML document to make it into a readable format. Here is an example of HTML document formatted with whitespaces, Hello-Formatted.html:

<html>
  <head>
    <title>
      Hello
    </title>
  </head>
  <body bgcolor="#ddddff">
    <p>
      Hello World!
    </p>
  </body>
</html>

To save storage space, you may want to remove whitespaces, also called blanks from a formatted HTML document.

According to the DOM Extension documentation, the LIBXML_NOBLANKS option on the loadHTML() method can be used to remove blank nodes.

Here is my test script, Load-HTML-with-NOBLANKS.php:

<?php
#  Load-HTML-with-NOBLANKS.php
#- Copyright 2009 (c) HerongYang.com. All Rights Reserved.

  $input = $argv[1];
  $option = $argv[2];
  $html = file_get_contents($input);
  $doc = new DOMDocument();

  if ($option=="noblanks") {
    $doc->loadHTML($html, LIBXML_NOBLANKS);
  } else {
    $doc->loadHTML($html);
  }

  print("\n------ HTML Document ------\n");
  print($doc->saveHTML());

  print("\n------ DOMNode Tree ------\n");
  printNode($doc, "");

function printNode($node, $prefix) {
  $attriLen = -1;
  $attriList = $node->attributes;
  if (isset($attriList)) $attriLen = $attriList->length;

  $childLen = -1;
  $childList = $node->childNodes;
  if (isset($childList)) $childLen = $childList->length;

  print($prefix . $node->nodeName . "= "
    . $attriLen . ", " . $childLen . ", " . $node->nodeType) . " ";
  if ($node->nodeType==XML_TEXT_NODE)
    print("text: (" . $node->nodeValue . ")\n");
  else if ($node->nodeType==XML_ATTRIBUTE_NODE)
    print("attribute: (" . $node->nodeValue . ")\n");
  else
    print("other: (...)\n");

  if (isset($attriList)) foreach ($attriList as $n)
    printNode($n, $prefix." @");
  if (isset($childList)) foreach ($childList as $n)
    printNode($n, $prefix." ");
}
?>

Test 1 - Run the script without LIBXML_NOBLANKS to see all blank nodes:

herong> php Load-HTML-with-NOBLANKS.php Hello-Formatted.html default

------ HTML Document ------
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" ...>
<html>
  <head>
    <title>
      Hello
    </title>
  </head>
  <body bgcolor="#ddddff">
    <p>
      Hello World!
    </p>
  </body>
</html>

------ DOMNode Tree ------
#document= -1, 2, 13 other: (...)
 html= -1, -1, 10 other: (...)
 html= 0, 5, 1 other: (...)
  #text= -1, -1, 3 text: (
  )
  head= 0, 3, 1 other: (...)
   #text= -1, -1, 3 text: (
    )
   title= 0, 1, 1 other: (...)
    #text= -1, -1, 3 text: (
      Hello
    )
   #text= -1, -1, 3 text: (
  )
  #text= -1, -1, 3 text: (
  )
  body= 1, 3, 1 other: (...)
   @bgcolor= -1, 1, 2 attribute: (#ddddff)
   @ #text= -1, -1, 3 text: (#ddddff)
   #text= -1, -1, 3 text: (
    )
   p= 0, 1, 1 other: (...)
    #text= -1, -1, 3 text: (
      Hello World!
    )
   #text= -1, -1, 3 text: (
  )
  #text= -1, -1, 3 text: (
)

As you see from the output, the default behavior of loadHTML() preserves all whitespaces as "#text" nodes.

Test 2 - Run the script with LIBXML_NOBLANKS to see which blank nodes are removed.

herong> php Load-HTML-with-NOBLANKS.php Hello-Formatted.html noblanks

------ HTML Document ------
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" ...>
<html><head><title>
      Hello
    </title></head><body bgcolor="#ddddff">
    <p>
      Hello World!
    </p>
  </body></html>

------ DOMNode Tree ------
#document= -1, 2, 13 other: (...)
 html= -1, -1, 10 other: (...)
 html= 0, 2, 1 other: (...)
  head= 0, 1, 1 other: (...)
   title= 0, 1, 1 other: (...)
    #text= -1, -1, 3 text: (
      Hello
    )
  body= 1, 3, 1 other: (...)
   @bgcolor= -1, 1, 2 attribute: (#ddddff)
   @ #text= -1, -1, 3 text: (#ddddff)
   #text= -1, -1, 3 text: (
    )
   p= 0, 1, 1 other: (...)
    #text= -1, -1, 3 text: (
      Hello World!
    )
   #text= -1, -1, 3 text: (
  )

The output shows some interesting surprises to me:

I think that LIBXML_NOBLANKS option is trying to preserve whitespaces inside <body>, because they do have meanings in HTML documents. This is different than XML documents.

Table of Contents

 About This Book

 Introduction and Installation of PHP

 Managing PHP Engine and Modules on macOS

 Managing PHP Engine and Modules on CentOS

 cURL Module - Client for URL

DOM Module - Parsing HTML Documents

 DOM (Document Object Model) Module

 Parse and Traverse HTML Documents

 Build New HTML Documents

Load HTML Documents with LIBXML_NOBLANKS

 Remove Whitespaces in HTML Documents

 DOCTYPE Element in HTML Documents

 Remove Dummy Elements in HTML Documents

 Install DOM Extension on CentOS

 GD Module - Manipulating Images and Pictures

 MySQLi Module - Accessing MySQL Server

 OpenSSL Module - Cryptography and SSL/TLS Toolkit

 PCRE Module - Perl Compatible Regular Expressions

 SOAP Module - Creating and Calling Web Services

 SOAP Module - Server Functions and Examples

 Zip Module - Managing ZIP Archive Files

 References

 Full Version in PDF/EPUB