Working with DOM in PHP – Looking at a PHP HTML Parser
June 6, 2010 – 12:25 pmSo, lets assume you’ve got a PHP project where you’re scraping pages and trying parse fields out of the DOM. Up till now, I’ve just used regular expressions because they’re easy. I avoided trying to parse html as xml using SimpleXML because there’s just to many cases where it would fail due to invalid tags.
Well, I feel like an idiot. It turns out there’s a great extension built into PHP to do just that, and it’s the DOM extension. Using this, parsing HTML with PHP is just as easy as accessing the DOM using JQuery. (hint: very easy).
Lets say we’ve got a page our drive already. For this example, I’ll use the homepage of this blog. We’re going to parse out all the links. I’ve saved the page as index.html and in the same directory I’ve created the parser script.
<? $dom = new DomDocument;
// you can use loadHTML if you already have your string in memory $dom->loadHTMLFile( "index.html" ); $dom->preserveWhiteSpace = false; // grab all the A tags // returns a domnodelist $tags = $dom->getElementsByTagName( 'a' ); // you can actually iterate over the tags returned -
// I'm not sure why they don't say that more explicitly
echo "Total length:" . count($tags->length) . "\n";
foreach($tags as $t)
{
// each of these is a DOMElement object
// the value is what's inside the tag
// the attributes can also be accessed
printf( "%-50s%s \n", $t->nodeValue, $t->getAttribute('href') );
}
Here’s a glimpse of the output:
vim http://www.rustyrazorblade.com/category/vim/ virtual box http://www.rustyrazorblade.com/category/virtual-box/ vmware http://www.rustyrazorblade.com/category/vmware/ weird http://www.rustyrazorblade.com/category/weird/ wikipedia http://www.rustyrazorblade.com/category/wikipedia/ windows http://www.rustyrazorblade.com/category/windows/ xcode http://www.rustyrazorblade.com/category/xcode/
Here’s another great reference I originally used to get started:
You can take this a bit further if you want to use the php curl extension. Additionally, if you’re interested in using the advanced curl_multi_exec functionality, check out my previous post.
Edit: cynope on reddit suggested phpquery. I haven’t used it yet but it looks pretty cool. If I get a chance to try it I’ll post a followup.



5 Responses to “Working with DOM in PHP – Looking at a PHP HTML Parser”
Yes, there are “easier” ways to parse HTML in php. However, I’ve found that I almost always end up using regex. Maybe it’s just me, the type of projects or the source, but DOM traversing was never good enough.
By Ming on Jun 9, 2010
The nice part of the DOM parser is that it can handle invalid HTML, as well as correctly parse out the attributes from a tag. I wrote that script in about 5 minutes vs the trial and error of using a bunch of regexes.
By jon on Jun 9, 2010
Usually, I need specific pieces of data from within each DOM element or attribute, so I would end up with regex anyways. In those cases I just go straight to regex and dump any additional extensions and parsers. Also, maybe it’s my background in perl, but regex has never been a long painstaking task.
Don’t get me wrong. For things like grabbing all links or images from a HTML source, using what you suggest is a great way to do it and very easy.
By Ming on Jun 9, 2010
Ming – I’m not sure I follow though. If you check the example, you can easily grab any of the attributes or values. (note the href pulled out using getAttribute(’href’) )
You can also do things like grab elements and their contents by ID (getelementbyid), then pull what you out out of that using the same techniques listed above.
I’d like to see an example where the regex is easier than the DOM parser.
By jon on Jun 9, 2010
Jon,
Yes, if you are pulling the entire element and contents then you are right. Or even if you are going to do some basic split or string match, etc.
I’m speaking more about the case where you would grab the contents of an element and then regex a specific value out of it. e.g. grab the text within a div and then extract a specific value. In that case where you would end up using regex anyways to get that value, I would probably just skip DOM and go straight to regex of the entire doc.
Keep in mind that this does not address performance or elegance, just my personal preference and style.
By Ming on Jun 10, 2010