I needed to strip out some DOM nodes from a HTML file. I would use SED but some of the tags are multiline, and SED/regexes really don’t understand HTML/XML and get really confused if you’re using nested tags of the same type. In the end I decided to use PHP’s built in DOMDocument functions. It is fairly strict and refuses to load if the HTML isn’t perfectly formed, so first I ran it through PHP’s tidy - this isn’t installed by default but you can add it in with a:
|
|
So first fix the malformed HTML:
|
|
Then it’s just a matter of ripping out the tags you don’t want. Note how we’re iterating through the $nodes variable - it MUST be done this way if you’re planning on removing the nodes (as I am) because as they’re removed they also disappear from the collection. A foreach will do some odd stuff - probably terminate after the first node, and a for-loop will have you missing every other node. Instead, just remove the first child until there are no children:
|
|