Working with DOM in PHP – Looking at a PHP HTML Parser
So, lets assume you’ve got a PHP project where you’re scraping pages and trying parse fields out of the DOM. Up till now, I’ve just used regular expressions because they’re easy. I avoided trying to parse html as xml using SimpleXML because there’s just to many cases where it would fail due to invalid tags.
Well, I feel like an idiot. It turns out there’s a great extension built into PHP to do just that, and it’s the DOM extension. Using this, parsing HTML with PHP is just as easy as accessing the DOM using JQuery. (hint: very easy).
Lets say we’ve got a page sitting on our local drive already. For this example, I’ll use the homepage of this blog. We’re going to parse out all the links. I’ve saved the page as index.html and in the same directory I’ve created the parser script.
<? $dom = new DomDocument;
// you can use loadHTML if you already have your string in memory $dom->loadHTMLFile( "index.html" ); $dom->preserveWhiteSpace = false; // grab all the A tags // returns a domnodelist $tags = $dom->getElementsByTagName( 'a' ); // you can actually iterate over the tags returned -
// I'm not sure why they don't say that more explicitly
echo "Total length:" . count($tags->length) . "\n";
foreach($tags as $t)
{
// each of these is a DOMElement object
// the value is what's inside the tag
// the attributes can also be accessed
printf( "%-50s%s \n", $t->nodeValue, $t->getAttribute('href') );
}
Here’s a glimpse of the output:
vim http://www.rustyrazorblade.com/category/vim/ virtual box http://www.rustyrazorblade.com/category/virtual-box/ vmware http://www.rustyrazorblade.com/category/vmware/ weird http://www.rustyrazorblade.com/category/weird/ wikipedia http://www.rustyrazorblade.com/category/wikipedia/ windows http://www.rustyrazorblade.com/category/windows/ xcode http://www.rustyrazorblade.com/category/xcode/
Here’s another great reference I originally used to get started:
You can take this a bit further if you want to use the php curl extension. Additionally, if you’re interested in using the advanced curl_multi_exec functionality, check out my previous post.
Edit: cynope on reddit suggested phpquery. I haven’t used it yet but it looks pretty cool. If I get a chance to try it I’ll post a followup.
7 Responses to Working with DOM in PHP – Looking at a PHP HTML Parser
Leave a Reply Cancel reply
Recent Comments
- Anil on MySQL Triggers Tutorial
- Ashish on MySQL Triggers Tutorial
- David on iCal Agenda
- jon on IP address geolocation SQL database
- pim on IP address geolocation SQL database
- jnns on Redis Wildcard Delete
- K.C. Murphy on iCal Agenda
- BA on Experts Exchange should be removed from Google search results
- Andrew on Executing multiple curl requests in parallel with PHP and curl_multi_exec
- Stu on Executing multiple curl requests in parallel with PHP and curl_multi_exec
Recent Posts
- New Project: Jester
- Open New Terminal Tip
- Installing MySQLdb on MacOS Lion
- Headless VM Server Using Ubuntu 11.10
- Get rid of Facebook’s Awful Ticker
- Api Tester now hosted on Github
- Trac .11 jQuery bug
- Multiple Filetypes in Vim
- Git Tip: Setting Up Your Remote Server
- Install issue pymongo on OSX (setuptools out of date)
Categories
- amazon (1)
- answerbag (6)
- apache (9)
- apple (8)
- awk (2)
- bbedit (2)
- c++ (3)
- chrome (2)
- cluster (1)
- cocoa (1)
- collective intelligence (1)
- curl (3)
- db2 (1)
- demand media (1)
- ebay (1)
- eclipse (4)
- erlang (13)
- facebook (1)
- fortran (1)
- gen_server (1)
- git (5)
- google (4)
- haddad (1)
- hdf5 (1)
- html (1)
- innodb (1)
- itunes (1)
- java (2)
- jester (1)
- kvm (1)
- launchbar (1)
- leex (1)
- letsgetnuts.com (1)
- libvirt (1)
- links (6)
- linux (27)
- lucene (1)
- mac (16)
- memcached (1)
- misconception (1)
- mobile (1)
- mono (1)
- mssql (1)
- munin (1)
- mysql (31)
- numpy (1)
- oracle (1)
- php (23)
- puppet (4)
- pyparsing (1)
- pytables (1)
- python (11)
- q&a (1)
- quicksilver (1)
- rant (6)
- readynas (1)
- redis (2)
- regex (1)
- replication (1)
- search (1)
- shitty code (1)
- solr (3)
- spaces (1)
- sshfs (1)
- stored procedure (1)
- svn (5)
- textmate (2)
- tips (22)
- trac (1)
- tutorial (4)
- ubuntu (3)
- Uncategorized (4)
- unix (1)
- vim (3)
- virtual box (6)
- vmware (1)
- weird (3)
- wikipedia (1)
- windows (1)
- xcode (1)








Yes, there are “easier” ways to parse HTML in php. However, I’ve found that I almost always end up using regex. Maybe it’s just me, the type of projects or the source, but DOM traversing was never good enough.
The nice part of the DOM parser is that it can handle invalid HTML, as well as correctly parse out the attributes from a tag. I wrote that script in about 5 minutes vs the trial and error of using a bunch of regexes.
Usually, I need specific pieces of data from within each DOM element or attribute, so I would end up with regex anyways. In those cases I just go straight to regex and dump any additional extensions and parsers. Also, maybe it’s my background in perl, but regex has never been a long painstaking task.
Don’t get me wrong. For things like grabbing all links or images from a HTML source, using what you suggest is a great way to do it and very easy.
Ming – I’m not sure I follow though. If you check the example, you can easily grab any of the attributes or values. (note the href pulled out using getAttribute(‘href’) )
You can also do things like grab elements and their contents by ID (getelementbyid), then pull what you out out of that using the same techniques listed above.
I’d like to see an example where the regex is easier than the DOM parser.
Jon,
Yes, if you are pulling the entire element and contents then you are right. Or even if you are going to do some basic split or string match, etc.
I’m speaking more about the case where you would grab the contents of an element and then regex a specific value out of it. e.g. grab the text within a div and then extract a specific value. In that case where you would end up using regex anyways to get that value, I would probably just skip DOM and go straight to regex of the entire doc.
Keep in mind that this does not address performance or elegance, just my personal preference and style.
Look up SimpleHTMLDOM. I use that all the time. It is a bit sluggish and occasionally awkward to use but I’ve rarely seen it fail to parse even the most evil HTML – such as Microsoft Word’s idea of “HTML”. Selecting DOM elements is flawless. It has difficulties with modifying the DOM though – I end up having to call save() and then load() every time I make any change before making more modifications or it gets confused.
phpquery looks interesting for those scenarios where I’m modifying the DOM frequently.
I feel so much happier now I udnrestnad all this. Thanks!