matt-helps

insight on all things techie

Code Snippet: Using PHP DOMDocument to remove nodes

I needed to strip out some DOM nodes from a HTML file. I would use SED but some of the tags are multiline, and SED/regexes really don’t understand HTML/XML and get really confused if you’re using nested tags of the same type. In the end I decided to use PHP’s built in DOMDocument functions. It is fairly strict and refuses to load if the HTML isn’t perfectly formed, so first I ran it through PHP’s tidy – this isn’t installed by default but you can add it in with a:

sudo apt-get install php5-tidy

So first fix the malformed HTML:

$html = file_get_contents("myfile.html");
$config = array(
    'indent' => true,
    'output-xhtml' => true,
    'wrap' => 0);
$tidy = tidy_parse_string($html, $config, 'UTF8');
$tidy->cleanRepair();

And then load it into DOMDocument:

$doc = new DOMDocument();
$doc->loadHTML($tidy)

Then it’s just a matter of ripping out the tags you don’t want. Note how we’re iterating through the $nodes variable – it MUST be done this way if you’re planning on removing the nodes (as I am) because as they’re removed they also disappear from the collection. A foreach will do some odd stuff – probably terminate after the first node, and a for-loop will have you missing every other node. Instead, just remove the first child until there are no children:

$nodes = $doc->getElementsByTagName("script");
while ($nodes->length > 0) {
    $node = $nodes->item(0);
    remove_node($node);
}
function remove_node(&$node) {
    $pnode = $node->parentNode;
    remove_children($node);
    $pnode->removeChild($node);
}
function remove_children(&$node) {
    while ($node->firstChild) {
        while ($node->firstChild->firstChild) {
            remove_children($node->firstChild);
        }
        $node->removeChild($node->firstChild);
    }
}

Ubuntu recovery mode: mounting read-only filesystem

I recently made some changes to my /etc/X11/xorg.conf file which backfired and Ubuntu (12.10 in this case) would crash while loading the operating system.

It should be simple enough to undo the last change you made to a system file when booting into recovery mode, but alas it is not straightforward because when booting into recovery mode the filesystem is mounted as read-only! Oh dear!

1. Boot into “Ubuntu (Recovery Mode)”
2. From the recovery menu drop to a root shell.
3. Enter the following command to remount your filesystem as read-write:

mount -o remount,rw /

4. Make the required changes to your operating system and then exit/reboot/whatever.

Hope that helps someone out there!

apache2ctl status no permission for /server-status

Today I was having a real problem. Every minute on my linux box I call

apache2ctl status

via a cronjob and this gives me various stats on how many servers are used/free. Useful. But upon moving some old websites off the server the above call began to fail – I’d get the error message

Forbidden
You don't have permission to access /server-status on this server.

There are lots of solutions around the web (some about SELinux, some about setting allow/deny in an apache config file under the location /server-status namespace), none of which actually solved my problem. They did give enough hints for me to work out what was going on! You see when you make the above call to apache2ctl it then (through mod_status) requests /server-status on the default website on your webserver. Now if you only have one site then that should be pretty easy to work out which one it is going to, but if you’re hosting multiple sites then its not quite as obvious. For me it was the first site in the /etc/apache2/sites-enabled directory, (or you can find which one it is by visiting the webserver via its IP address rather than through a domain name as it is the domain name that is used to direct the browser call to a specific site that you host – if you visit your webserver via its IP address it doesn’t really know what to serve you so you’re sent to the default site).

So my default site was refusing to serve /server-status. I couldn’t think why this was until I visited my .htaccess file and realised that I was using mod_rewrite to rewrite any www.mysite.com calls to mysite.com like so:

RewriteEngine On
RewriteCond %{HTTP_HOST} !^mysite\.com$
RewriteRule (.*) http://mysite.com/$1 [R=301,L]

The answer then was to pop in an exclusion for /server-status so that it would be unaffected by mod_rewrite:

RewriteEngine On
RewriteCond %{HTTP_HOST} !^mysite\.com$
RewriteCond %{REQUEST_URI} !^/server-status
RewriteRule (.*) http://mysite.com/$1 [R=301,L]

And hey presto it works! Hurrah!

UPDATE: I’ve found a better version. I host the site on a virtual server (amazon’s ec2 stuff) and that means I often need to test a new instance of the server using either the IP address or a funny amazon made up name like ec2-99-98-97-96.compute-7.amazonaws.com, but if I stick that in the address bar of a browser the .htaccess reroutes it to mysite.com (the one that is live), so really I just need to slice off the www from any HTTP_HOST that is found and keep an exception in there for /server-status so that apache2ctl status doesn’t complain. Here’s what it looks like now:

RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_HOST} ^www\.(.*)$ [NC]
RewriteCond %{REQUEST_URI} !^/server-status
RewriteRule ^(.*)$ http://%1/$1 [R=301,L]

If you wanted to be ultra portable you could put an if around it:

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_HOST} ^www\.(.*)$ [NC]
RewriteCond %{REQUEST_URI} !^/server-status
RewriteRule ^(.*)$ http://%1/$1 [R=301,L]
</IfModule>

dnscurl.pl name lookup timed out (Amazon Route53)

Had some problems this morning migrating one of my domains routing over to route53, and google can’t find anyone talking about this issue so (as is my policy) I thought I had better make the information available for other people facing the same issue. I was attempting to use dnscurl.pl to create a hosted zone but would come back with the error:

curl: (6) name lookup timed out
Ouch, curl --progress-bar -I --max-time 5 --url https://route53.amazonaws.com/date --insecure failed with exit status 6

I couldn’t work out whether it mean that some sort of insecurity had made it fail (please use quotes to show where the command ends in future!). So I ran the command above (from curl onwards) on its own to see what would happen:

curl --progress-bar -I --max-time 5 --url https://route53.amazonaws.com/date --insecure

…and sure enough, back came the error:

curl: (6) name lookup timed out

Ok, so its time out or the site isn’t up at all, not a security related issue. I then browsed to https://route53.amazonaws.com/2011-05-05 (correct at the time of publishing) and found that it took more than 5 seconds, but that it did eventually connect. It came back with:

UnknownOperationException/

So the site is working and up, but that it was taking quite a while. In the above curl command we’re just giving it 5 seconds to respond which isn’t too long in the world of requests really, so I altered the above command to set the time-out to be 15 seconds rather than 5 and it worked fine (http code 200):

curl --progress-bar -I --max-time 15 --url https://route53.amazonaws.com/date --insecure

So somewhere in dnscurl.pl there’s a line similar to the above one but it only allows 5 seconds for a response, absolutely not long enough, so we need to change it to something more sensible, perhaps 20 or 30 seconds will do. I’ve never played with perl, but how different can it be right? I openned dnscurl.pl in an editor and found what looked like the right reference to CURL on line 190 (your results may vary), which looks like this:

my $curl_output_lines = run_cmd_read($CURL, "--progress-bar", "-I", "--max-time", "5", "--url", $url, "--insecure");

Simply changing the number 5 to a 30 seemed to solve the problem because the next time I called dnscurl.pl it all worked splendiferously.

monodroid: System.NullReferenceException in aresgen.exe

Just in case anyone else gets this uninformative error on their monodroid packaging, I got mine when I included an ImageView in my layout and this highlighted two problems.

1. I got the useless message because I hadn’t created an AndroidManifest.xml file, new projects don’t have them by default, so right-click on the project, go to properties, then Android Manifest, and create one. Fill in some details, the file needs to exist anyway, but you can get away with not having one until you put an ImageView in there.

2. Then when you rebuild the project you’ll get a sensible message out of aresgen.exe, which for me was an incorrectly spelt resource name. My .png file was in the correct directory (Resources/Drawable) but I’d misspelled it by one letter.

Hope this saves someone else 30 minutes of searching around for answers.


Follow mattparkins on Twitter