Scraping Tricks: Nodes between other nodes

When writing scrapers, I often need to find all the elements between two others — e.g. find all the ‘h3’ tags in-between a named ‘h2’ and the next one. So, for example, given the following structure:

<h2>Heading 1</h2>
  <h3>Item 1</h3>
  <h3>Item 2</h3>
  <h3>Item 3</h3>
<h2>Heading 2</h2>
  <h3>Item 4</h3>
  <h3>Item 5</h3>
  <h3>Item 6</h3>
<h2>Heading 3</h2>
  <h3>Item 7</h3>
  <h3>Item 8</h3>
  <h3>Item 9</h3>

I want to find all `h3`s that come between “Heading 2” and the next `h2`.

Finding the relevant `h2` is trivial:

node = xpath('.//h2[text="Heading 2"]')

But the layout above is actually slightly misleading, as it makes it look like the `h3` are children of the `h2`, when actually they’re siblings. But following-sibling::h3 will return all the h3 nodes (down to Item 9), not just the ones down to Item 6.

If you know the structure well enough to say that you want exactly 3 `h3` nodes, then it’s also easy, but mostly I don’t have that information, or it’s data that will change regularly.

I’ve tried multiple different approaches to this in the past, none of which were particularly appealing, and explanations of how to do this in pure XPath tend to be impressively convoluted. See, for example, this StackOverflow answer.

But today I found an approach that I think is quite neat:

node.xpath('following-sibling::h2 | following-sibling::h3').slice_before { |e| == 'h2' }.first

This tells XPath to find all the following `h2` and `h3` nodes, but then uses Ruby’s `slice_before` method to partition the list at the next `h2`. I’d actually never seen this method before, and the documentation for it is a little cryptic, but essentially it’s a way of partitioning a list based on the given test: all the nodes up to the first time it returns true are split out from all the nodes after that. The `.first` then grabs the first list returned (the `h3`s before the next `h2`), ready for further processing.

Leave a Reply

Your email address will not be published. Required fields are marked *