Load HTML Documents with LIBXML

PHP Modules Tutorials - Herong's Tutorial Examples

∟Load HTML Documents with LIBXML_NOBLANKS

This section provides a tutorial example on how to load HTML documents with the LIBXML_NOBLANKS using the PHP DOM Extension. It only removes whitespaces between tags inside the 'head' element.

In HTML documents, whitespace (any sequence of space, tab, carriage return and line feed) is largely ignored. To be precise, whitespace in between words is treated as a single character, and whitespace before and after element tags is ignored.

Whitespace is usually inserted into an HTML document to make it into a readable format. Here is an example of HTML document formatted with whitespaces, Hello-Formatted.html:

<html>
  <head>
    <title>
      Hello
    </title>
  </head>
  <body bgcolor="#ddddff">
    <p>
      Hello World!
    </p>
  </body>
</html>

To save storage space, you may want to remove whitespaces, also called blanks from a formatted HTML document.

According to the DOM Extension documentation, the LIBXML_NOBLANKS option on the loadHTML() method can be used to remove blank nodes.

Here is my test script, Load-HTML-with-NOBLANKS.php:

<?php
#  Load-HTML-with-NOBLANKS.php
#- Copyright 2009 (c) HerongYang.com. All Rights Reserved.

  $input = $argv[1];
  $option = $argv[2];
  $html = file_get_contents($input);
  $doc = new DOMDocument();

  if ($option=="noblanks") {
    $doc->loadHTML($html, LIBXML_NOBLANKS);
  } else {
    $doc->loadHTML($html);
  }

  print("\n------ HTML Document ------\n");
  print($doc->saveHTML());

  print("\n------ DOMNode Tree ------\n");
  printNode($doc, "");

function printNode($node, $prefix) {
  $attriLen = -1;
  $attriList = $node->attributes;
  if (isset($attriList)) $attriLen = $attriList->length;

  $childLen = -1;
  $childList = $node->childNodes;
  if (isset($childList)) $childLen = $childList->length;

  print($prefix . $node->nodeName . "= "
    . $attriLen . ", " . $childLen . ", " . $node->nodeType) . " ";
  if ($node->nodeType==XML_TEXT_NODE)
    print("text: (" . $node->nodeValue . ")\n");
  else if ($node->nodeType==XML_ATTRIBUTE_NODE)
    print("attribute: (" . $node->nodeValue . ")\n");
  else
    print("other: (...)\n");

  if (isset($attriList)) foreach ($attriList as $n)
    printNode($n, $prefix." @");
  if (isset($childList)) foreach ($childList as $n)
    printNode($n, $prefix." ");
}
?>

Test 1 - Run the script without LIBXML_NOBLANKS to see all blank nodes:

herong> php Load-HTML-with-NOBLANKS.php Hello-Formatted.html default

------ HTML Document ------
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" ...>
<html>
  <head>
    <title>
      Hello
    </title>
  </head>
  <body bgcolor="#ddddff">
    <p>
      Hello World!
    </p>
  </body>
</html>

------ DOMNode Tree ------
#document= -1, 2, 13 other: (...)
 html= -1, -1, 10 other: (...)
 html= 0, 5, 1 other: (...)
  #text= -1, -1, 3 text: (
  )
  head= 0, 3, 1 other: (...)
   #text= -1, -1, 3 text: (
    )
   title= 0, 1, 1 other: (...)
    #text= -1, -1, 3 text: (
      Hello
    )
   #text= -1, -1, 3 text: (
  )
  #text= -1, -1, 3 text: (
  )
  body= 1, 3, 1 other: (...)
   @bgcolor= -1, 1, 2 attribute: (#ddddff)
   @ #text= -1, -1, 3 text: (#ddddff)
   #text= -1, -1, 3 text: (
    )
   p= 0, 1, 1 other: (...)
    #text= -1, -1, 3 text: (
      Hello World!
    )
   #text= -1, -1, 3 text: (
  )
  #text= -1, -1, 3 text: (
)

As you see from the output, the default behavior of loadHTML() preserves all whitespaces as "#text" nodes.

Test 2 - Run the script with LIBXML_NOBLANKS to see which blank nodes are removed.

herong> php Load-HTML-with-NOBLANKS.php Hello-Formatted.html noblanks

------ HTML Document ------
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" ...>
<html><head><title>
      Hello
    </title></head><body bgcolor="#ddddff">
    <p>
      Hello World!
    </p>
  </body></html>

------ DOMNode Tree ------
#document= -1, 2, 13 other: (...)
 html= -1, -1, 10 other: (...)
 html= 0, 2, 1 other: (...)
  head= 0, 1, 1 other: (...)
   title= 0, 1, 1 other: (...)
    #text= -1, -1, 3 text: (
      Hello
    )
  body= 1, 3, 1 other: (...)
   @bgcolor= -1, 1, 2 attribute: (#ddddff)
   @ #text= -1, -1, 3 text: (#ddddff)
   #text= -1, -1, 3 text: (
    )
   p= 0, 1, 1 other: (...)
    #text= -1, -1, 3 text: (
      Hello World!
    )
   #text= -1, -1, 3 text: (
  )

The output shows some interesting surprises to me:

LIBXML_NOBLANKS option removes blank nodes between tags inside <head>. For example spaces and line breaks between <head> and <title> are removed as a blank node.
LIBXML_NOBLANKS option does not remove any blank nodes between tags inside <body>. For example spaces and line breaks between <head> and <title> are removed as a blank node.

I think that LIBXML_NOBLANKS option is trying to preserve whitespaces inside <body>, because they do have meanings in HTML documents. This is different than XML documents.