© Copyright July 18, 2005 Mike Banks ValentineSearch engine listing delays have come to be called
Google Sandbox effect are actually true in practice at each of four top tier search engines in one form or another. MSN, it seems has
shortest indexing delay at 30 days. This article is
second in a series following
spiders through a brand new web site beginning on May 11, 2005 when
site was first made live on that day under a newly purchased domain name. First Case Study Article
Previously we looked at
first 35 days and detailed
crawling behavior of Googlebot, Teoma, MSNbot and Slurp as they traversed
pages of this new site. We discovered
each robot spider displays distinctly different behavior in crawling frequency and similarly differing indexing patterns.
For reference, there are about 15 to 20 new pages added to
site daily, which are each linked from
home page for a day. Site structure is non-traditional with no categories and a linking structure tied to author pages listing their articles as well as a "related articles" index varied by linking to relevant pages containing similar content.
So let's review where we are with each spider crawling and look at pages crawled and compare pages indexed by engine.
The AskJeeves spider, Teoma has crawled most of
pages on
site, yet indexes no pages 60 days later at this writing. This is clearly a site aging delay that's modeled on Google's Sandbox behavior. Although
Teoma spider from Ask.com has crawled more pages on this site than any other engine over a 60 day period and appears to be tired of crawling as they've not returned since July 13 - their first break in 60 days.
In
first two days, Googlebot gobbled up 250 pages and didn't return until 60 days later, but has not indexed even a single page in 60 days since they made that initial crawl. But Googlebot is showing a renewed interest in crawling
site since this crawling case study article was published on several high traffic sites. Now Googlebot is looking at a few pages each day. So far no more than about 20 pages at a decidedly lackluster pace, a true "Crawl" that will keep it occupied for years if continued that slowly.
MSNbot crawled timidly for
first 45 days, looking over 30 to 50 pages daily, but not until they found a robots.txt file, which we'd neglected to post to
site for a week and then bobbled
ball as we changed site structure, then failed to implement robots.txt in new subdomains until day 25 - and THEN MSNbot didn't return until day 30. If little else were discovered about initial crawls and indexing, we have seen that MSNbot relies heavily on that robots.txt file and proper implementation of that file will speed crawling.
MSNbot is now crawling with enthusiasm at anywhere between 200 to 800 pages daily. As a matter of fact, we had to use a "crawl-delay" command in
robots.txt file after MSNbot began hitting 6 pages per second last week. The MSN index now shows 4905 pages 60 days into this experiment. Cached pages change weekly. MSNbot has apparently found that it likes how we changed
page structure to include a new feature which links to questions from several other article pages.