My Scraping Stack
Enabling yourself to work with data sources from the web still has a lot to do with scraping. And, I assume, it will for a long time to come. Even if the data you need is available via a webservice API, what you do in order to harvest larger amounts of data is something like scraping. Again and again you will have to answer the same questions.
- What language should I use to write my scraper?
- Which HTTP client should I use?
- How can I parse HTML input, maybe XML, JSON and all the rest?
- How can I scrape pages that require form submission, session handling etc.?
- How can I store the data?
- How can I avoid scraping the same stuff over and over again?
- How can I make the scraper fault tolerant, so that it won’t brake with a single 404 response?
If you’re simply gathering content from one HTML table, you probably won’t mind. But as your projects grow larger, these questions wil become more important.
I have spent a great share of the recent two years with writing, re-writing and maintaining serveral scrapers. One of them, called Scrape-A-RIS, builds the foundation for an Open Data version of the Cologne, Germany local parliament platform. And I have had occasions to reason about these questions. I thought It would be time to discuss the current status of my answers here in public, so someone could benefit from what I learned or call “bullocks”.
There isn’t much to argue here. You have your scripting language of choice, I have mine. All I’ve regretted since working with Python is the fact that I spent so much time in my past with PHP. But let’s not get religious here. This post isn’t about any language, but Python is the foundation of my stack. If you’re not into Python, this article might either not be useful, or you might start taking a glimpse.
The HTTP client
Python standard modules like urllib, urllib2, urllib3, all have some ways for you to request a resource from an HTTP server. The problem: They’re a mess.
My choice is the Requests module. It has a straight forward API that I can remeber how to use and it gives a great amount of control without requiring me to think about all of that of all I want is a simple GET request. It makes HTTP headers case-insensitive (so I can assume they’re all lowercase) and more.
In some of my projects, there is still one exception to the rule, when I use mechanize. We get to that.
Writing code to parse HTML is the worst thing about the whole concept of scraping. You can spend (and waste) a lot of time here. And you can get really philosophical about how this should be done right. For example, you can think a lot about how to make your parsing code long-lasting and insusceptible to change.
I’ve tried several ways. For example, scrapemark looked nice in the beginning. It seems to allow describing HTML structures with very little detail. Which would mean that if tiny details of the HTML change, you won’t have to adapt the scraper. However, scrapemark just gave me completely unexpected results in some cases. It just didn’t behave as expected, and while doing so, it was impossible to debug. So I dropped it finally.
Currently I use the lxml Python module with straight-forward XPath for parsing. This seems to be the right thing for all kinds of (X)HTML input. Be it broken HTML 3.1 or well-formed XML, with lxml, there is a way to make a tree out of it. Here is some sample code that should digest all forms:
>>> from lxml import etree
>>> html = "<p>This is bad HTML.<p>And more.<p>"
>>> tree = etree.HTML(html)
>>> for el in tree.xpath("//p"):
>>> print el.tag, el.text
p This is bad HTML.
p And more.
At first I wasn’t too keen to work with XPath as a medium of expression. lxml gives you the option to use CSS-style selectors to create XPath strings. The according module is called lxml.cssselect. But it’s not a drop-in replacement for XPath and makes code more complicated. At least, if you’re comfortable with CSS but not with XPath, you can use this in order to create XPath expressions for use in your parsing code.
Parsing HTML in digestable chunks
Often HTML markup makes it impossible (or not sensible) to address an element in the page direcetly. Say you are targeting a table that is preceded by five other tables. Will you hard-code the index number 6 of your target table into your code?
In cases like this, I often end up iterating over the top-level blocks until I know that I’m in the right block. Then I apply more detailed parsing of that block. Regarding this example, I would probably iterate over all tables (using the appropriate XPath) and check for some condition(s) until I know that the table at hand is the one with the expeced format. Then I apply some more XPath and iterate the rows of that table in order to extract the data.
Submitting forms, handling sessions
I can’t tell why, but it appears as if mechanize was the only beast that handles even the worst of these cases. Unfortunately, the mechanize documentation is horrible and the API seems to make no sense at all – quite the opposite of Requests.
Here is a working example that shows how to download a number of PDFs from a page, especially adapted for that case:
>>> import mechanize
>>> url = "http://buergerinfo.mannheim.de/buergerinfo/to0040.asp?__ksinr=4626"
>>> response1 = mechanize.urlopen(mechanize.Request(url))
>>> forms = mechanize.ParseResponse(response1, backwards_compat=False)
>>> for form in forms:
>>> for control in form.controls:
>>> if control.name == 'DT':
>>> request2 = form.click()
>>> response2 = mechanize.urlopen(request2)
>>> form_url = response2.geturl()
>>> if "getfile.asp" in form_url:
>>> document = response2.read()
>>> # do something with that document
In ths example, we iterate all forms in that page in order to check each for the existence of a certain input field named “DT”. If it’s there, we submit that form and handle the response. See how the ParseResponse method intuitively returns an iterator of all forms within a page? Not. Awkward as it is, in some cases where everything else fails it seems worhtwhile to check out the sparse mechanize documentation and even the source code.
Among the things I’ve employed are: write data to relational database tables (MySQL, mostly), write JSON files, write files like PDFs etc. to the file system, and last but not least, use NoSQL database.
I tend to use the no-SQL-approach now right away and I’m currently mostly happy with MongoDB. For me, this combines many benefits of the other approaches but helps avoid some of the problems.
But let’s step back and start with JSON files. Writing each scraped “record” to an individual JSON file is easy. It allows you not to worry too much about the schema. And in practice, depending on the complexity of the data to be scraped, defining a schema might become a big issue. When starting your scraping you might not know yet what the 1000th record to be scraped might look like. So being able to just write it all away is a good thing. Then, on the other hand, thousands or hundreds of thousands of (small) files are not easily handled. Once you start to actually use (=read) your data for analysis, you will regret all that file opening, reading and parsing. Let allone manipulation.
Storing to a relational database might be the way to go for simple tabular data. But once you deal with data that has to be broken down to several tables with records linked by foreign keys, things become nasty. Not only do you have to be completely aware of the structure of the data at hand (which is, as I said, not necessarily possible), you also have to make sure foreign key constrains are met during write and update transactions.
And then there is MongoDB. MongoDB doesn’t care about the schema of your data. You can strore practically everyhting as a “document” that fits inside a Python dict. When I start a scraper project nowadays, I tend to store the roughest form of parsed data into MongoDB, not worrying too much about date formats etc. – this can be done later when I find out that I actually want to work with that attribute. Since it’s effortless to manipulate documents once they are stored in MongoDB, I find myself iteratively writing clean-up-scripts with simple routines that can then, step by step, be moved into the scraper code. This way I can do first explorations with the data early and don’t have to re-scrape everything when I make up my mind about the schema. It’s an iterative approach, really.
For the curious: To access MongoDB data, I use pymongo and no additional layer on top.
The two remaining questions from the list at the top (avoid duplicate scraping, be fault tolerant) have guided me to employ a job queue for my larger scrapers. Let me explain.
Imagine you have a scraper that has to recursively go throught thousands of search result entries on a website. Then, after hours of scraping, due to a network outage, one request throws an exception that kills your code. Or, even more nasty, a piece of your parser produces an error on a detail page. You fix the code and then what? Scrape again?
There might be many cases where this would actually work. E.g. you have a durable key for the resources you scrape (say an ID or a permanent URL). Then, while working through a list of items on a page, look up that key in your database and only go scrape those you haven’t fetched yet.
But what if your scraper project isn’t a one-off, but should also work to update existing records, in case the data on the website has changed? In that case, my recommendation is to build a persistent job queue. The scraper first has the job to fill the queue with keys of items (say URLs) to be scraped – or you might even manually add some. Once you have jobs in the queue, your scraper reads open jobs one by one from the queue and scrapes the entries. It’s recommendable to mark a job as “in progress” when the scraper takes over. Then, after an entry is scraped, the job status is changed to “done”. And so forth, until no more open jobs remain. During scraping one entry, the scraper might find a reason to add more jobs to the queue, in order to fetch more previously unseen items which have been mentioned (linked) on a page. No problem.
All you have to make sure is that there can be only one job per target resource in the queue. Since I use MongoDB as the storage for scraped data, it seems natural for me to maintain the queue in MongoDB, too. The uniquess of the job entry can easily be guaranteed by an index with the “unique” setting.
So what happens if something kills the scraper while it works from the queue? There might be a number of open and finished jobs in the queue as well as one “in progress” job. You can set up your scraper in a way that it resets “in progress” jobs at the beginning to be “open” again. Then all you need is start working of the queue again.
After your scraper has finished and all jobs are done, it’s your responsibility to empty the queue.
If you’re interested in sample code for the queue, I have set up a gist with an (untested) simplified version of a queue I use in one of my projects.
Wrapping it up
This is pretty much it. I hope you either find this interesting enough or you can correct me where I’m wrong. In both cases, feel free to add comments!