The other day I needed to do some html scraping to trim out some repeated data stuck inside nested divs and produce a simplified array of said data. My first port of call was SimpleXML which I have used many times. However this time, the son of a bitch just wouldn’t work with me and kept on throwing up parsing errors. I lost my patience with it and decided to give DomDocument and DOMXpath a go which I’d heard of but never used.
All of the examples I could find showed basic parsing of html, e.g. grabbing all a elements on a page, or grabbing all the child book elements in a bookstore XML document. I needed to grab all the divs with a certain class then further parse their html to pull out bits of data. Here’s the html I’m working with for this example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |
Using DOMXPath::query to extract html data
I want an array of image src attribute values along with the value for the p elements. The basic method of doing this is to query for the div.foo elements using DOMXPath::query which will return a DOMNodeList. The list can then be iterated over which will produce a DOMNode every iteration. Once the DOMNode for the div.foo element has been obtained, we need to then query again using DOMXPath::query but crucially pass the node as the $contextnode 2nd parameter, this will make the query relative to the div.foo node and allow us to grab the child img element’s src attribute, and also the description from the p element.
Here’s the test script:
Which gives the desired result:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | |