Thursday, August 4, 2011

Screen scraping: not dead, just renamed

A few years back I blogged about the act of screen scraping and why it was wrong to do this.

Since that time I've seen clients (primarily eCommerce ones) develop feeds for:



  • Affiliates


  • Google product search


  • 3rd party applications (e.g. Mobile apps)


  • and other such things


This however means that a decent website may quickly end up developing a number of different feeds that all do different things. It can therefore create a mesh of XML files & API's that can become quite complex to maintain and manage.

One approach is to create a single feed that is then used for everything, perhaps going via a marketing agency, who can then reformat it for different purposes. However this can turn out to be a pretty bulky file (e.g. if you have a large catalogue, this can quickly become several megabytes in size) or can contain details that you might not wish all parties to have (e.g. links to your hi-res images from your Content Delivery Network that you may be paying by the megabyte for).

So I was reasonably interested in this article from eConsultancy that seemed to address this very issue. Had they really found a decent solution to this problem? One that I think will only get worse over time as the needs of eCommerce sites grow.....

Well the answer lies in this part of the posting:



Next-generation data feed solutions allow feeds to be generated and deployed
quickly and at low cost by extracting the ‘front end’ product-related HTML code
from the website, with no requirement for any ‘back end’ data – or expertise on
the part of the merchant. By harvesting elements such as pricing, availability
and product attributes directly from the merchant’s website, it is possible to
ensure that the extracted data feed is comprehensive and accurate


So let me get that straight. This 'next generation' method doesn't use an actual data feed from the site owner. It works by 'harvesting elements' from the HTML of the merchant's site without their actual involvement.

And that's not screen-scraping how exactly?

No comments: