25 March 2013

I needed to strip out some DOM nodes from a HTML file. I would use SED but some of the tags are multiline, and SED/regexes really don't understand HTML/XML and get really confused if you're using nested tags of the same type. In the end I decided to use PHP's built in DOMDocument functions. It is fairly strict and refuses to load if the HTML isn't perfectly formed, so first I ran it through PHP's tidy - this isn't installed by default but you can add it in with a:

sudo apt-get install php5-tidy
So first fix the malformed HTML:
 1 <?php
 2 $html = file_get_contents("myfile.html");
 3 $config = array(
 4 	'indent'         => true,
 5 	'output-xhtml'   => true,
 6 	'wrap'           => 0);
 7 $tidy = tidy_parse_string($html, $config, 'UTF8');
 8 $tidy->cleanRepair();
 9 
10 //And then load it into DOMDocument:
11 
12 $doc = new DOMDocument();
13 $doc->loadHTML($tidy)
14 ?>
Then it's just a matter of ripping out the tags you don't want. Note how we're iterating through the $nodes variable - it MUST be done this way if you're planning on removing the nodes (as I am) because as they're removed they also disappear from the collection. A foreach will do some odd stuff - probably terminate after the first node, and a for-loop will have you missing every other node. Instead, just remove the first child until there are no children:
 1 <?php
 2 $nodes = $doc->getElementsByTagName("script");
 3 while ($nodes->length > 0) {
 4     $node = $nodes->item(0);
 5     remove_node($node);
 6 }
 7 
 8 function remove_node(&$node) {
 9     $pnode = $node->parentNode;
10     remove_children($node);
11     $pnode->removeChild($node);
12 }
13 
14 function remove_children(&$node) {
15     while ($node->firstChild) {
16         while ($node->firstChild->firstChild) {
17             remove_children($node->firstChild);
18         }
19 
20         $node->removeChild($node->firstChild);
21     }
22 }
23 ?>


blog comments powered by Disqus