Tuesday, April 29, 2014

Advanced robots.txt techniques – Crawl Delay

We have a client we have been carrying out search engine optimisation work for over the last 3 months. However, recently we noticed that they had a bit of an issue with their robots.txt

Now for those who don’t know, a robots.txt file sits in the root directory of your website and it is there to tell search engines which files they should and should NOT index (The only real exception to this rule is that some search engines also allow you to provide the address of your XML Sitemap in this file)

Note: placing a file or directory to exclude in your robots.txt file is no guarantee that these pages will not be indexed by search engines. It is merely a way to indicate that you don’t want the page appearing in Google, Bing/Yahoo, etc. … but it still might.

Anyhow, back to this problematic little robots.txt file – here it is, in all its glory:
User-agent: *
Crawl-Delay: 10

The main issue here is with the command “Crawl-Delay: 10”

Note: This was put in automatically by the client’s eCommerce application and not manually by the client.
It is important to know that the “Crawl-Delay” command is not a representation of crawl rate (e.g. the amount of pages indexed at any one time), but instead it defines the amount of time (from 1 to 30 seconds) that the search engine "bot" will wait between crawling each and every page of your site. Meaning that the higher the figure, the lesser the number of pages on your site that actually get indexed.

The original purpose of this command was to stop search engine spiders from tearing through a large site and having a performance effect (e.g. too many crawls at once could even bring a site to its knees). However, Google publically states that they do not support the crawl delay command, (although Bing does), so the case for actually having this command there in the first place is significantly reduced.
Moreover, this crawl delay command poses an issue for some search engine optimisation services, such as Raven Tools. Which can’t effectively crawl sites when this single line is present.

So my advice…. Unless you have strange and less common search engine spiders frequently visit and they have a noticeable effect on your site’s stability, I would remove this command if at all possible.

No comments: