Thursday, January 22, 2009

Screen Scraping

Following an evening in the company of an old colleague and friend John G, we discussed the pros & cons of screen scraping. "What's that then Hayden?" I hear various people asking.

Screen scraping is the process of electronically grabbing content from an interface designed for human viewing. In the pre-web days it was used as a way of getting displayed system information from terminals. Now it generally refers to the technique of grabbing the HTML on a web page and inserting that content into a file or database for subsequent use.

So, what are the pros?
Well, by running a screen scraping routine, you can obtain data from a website that you would either have to manually copy & paste to another source(e.g. a spreadsheet). This routine could be automated to run at a particular time (e.g. just after it was updated at midnight) and may save you having to integrate with the site directly or paying the site owner for an export of the content you need.

But what are the cons?
Well, firstly its rather under-hand. Yes, it is just automating a manual process you may-well be doing anyway, but the question should be raised as to why you need to obtain lots of information from the original source in this way (and presumably without their permission)? The terms & conditions of many sites will prohibit you from doing this, especially if you have to register / pay for browsing premium information that you then want to scrape. It should also be noted that you are obtaining information from a website in a known layout/code structure... any change to that code will mean your routine will not work (and some websites deliberately do this for that very reason).
In addition, some sites will be very quick to notice screen scraping, especially if it is likely to affect their revenue or purpose. Using network techniques they could then block your access and counter your efforts.

To quote (without his permission) Eric Raymond of the Jargon File:

screen-scraping is an ugly, ad-hoc, last-resort technique that is very likely to break on even minor changes to the format of the data being snooped.
Post a Comment