Scraping Tricks: Obfuscated ‘cfemail’ addresses

Many websites try to obfuscate any email addresses they display, so that spammers don’t harvest them.

The utility of this is debatable — crawlers that look for email addresses have mostly evolved to be able to find these anyway — but it’s annoying for those of us scraping the details for other purposes.

My two default options for dealing with this are to either use Ruby’s ExecJS library (to run relatively simple chunks of JS), or use a PhantomJS-based scraper for more complex cases.

Today, I had an example (when scraping details of the Guatemalan Assembly) where I should have been able to use ExecJS, but it was complaining about unbalanced brackets. After a few minutes of trying to work out where the problem was, I realised the de-obfuscation code was itself relatively simple to just reimplement in Ruby.

In JavaScript it looks like this:

<script cf-hash='f9e31' type="text/javascript">
/* <![CDATA[ */!function(){try{var t="currentScript"in document?document.currentScript:function(){for(var
t=document.getElementsByTagName("script"),e=t.length;e--;)if(t[e].getAttribute("cf-hash"))return t[e]}
();if(t&&t.previousSibling){var e,r,n,i,c=t.previousSibling,a=c.getAttribute("data-cfemail");if(a)
{for(e="",r=parseInt(a.substr(0,2),16),n=2;a.length-n;n+=2)
i=parseInt(a.substr(n,2),16)^r,e+=String.fromCharCode(i);e=document.createTextNode(e),c.parentNode.replaceChild(e,c)
}}}catch(u){}}();/* ]]> */</script>

Most of that, however, is simply getting the data to be processed, and then injecting the result back into the displayed document. The actual de-obfuscation code is just:

for (e = "", r = parseInt(a.substr(0,2), 16), n = 2; a.length - n; n += 2) {
  i = parseInt(a.substr(n, 2), 16) ^ r
  e+=String.fromCharCode(i)
}

This takes the ‘data-cfemail’ value (which can be extracted with a Nokogiri search like noko.css('a.__cf_email__/@data-cfemail')), and looks like ‘f194909b929881b1929e9f968394829edf969e93df9685’), and treats it as consecutive pairs of hex digits. Each pair after the first is XOR-ed with the first to get a value that can be turned back into the next character of the address.

Or, in ruby:

def parse_cfemail(str)
  list = str.scan(/../).map { |str| str.to_i(16) }
  key = list.shift
  list.map { |i| (key ^ i).chr }.join
end

You can see it in action in the final scraper.

Leave a Reply

Your email address will not be published. Required fields are marked *