There has been endless webmaster speculation and worry about so-called "Google Sandbox" - indexing time delay for new domain names - rumored to last for at least 45 days from date of first "discovery" by Googlebot. This recognized listing delay came to be called "Google Sandbox effect."Ruminations on algorithmic elements of this sandbox time delay have ranged widely since indexing delay was first noticed in spring of 2004. Some believe it to be an issue of one single element of good search engine optimization such as linking campaigns. Link building has been focus of most discussion, but others have focused on possibility of size of a new site or internal linking structure or just specific time delays as most relevant algorithmic elements.
Rather than contribute to this speculation and further muddy Sandbox, we'll be looking at a case study of a site on a new domain name, established May 11, 2005 and specific site structure, submissions activity, external and internal linking. We'll see how this plays out in search engine spider activity vs. indexing dates at top four search engines.
Ready? We'll give dates and crawler action in daily lists and see how this all plays out on this single new site over time.
* May 11, 2005 Basic text on large site posted on newly purchased domain name and going live by days end. Search friendly structure implemented with text linking making full discovery of all content possible by robots. Home page updated with 10 new text content pages added daily. Submitted site at Google's "Add URL" submission page.
* May 12 - 14 - No visits by Slurp, MSNbot, Teoma or Google. (Slurp is Yahoo's spider and Teoma is from Ask Jeeves) Posted link on WebSite101 to new domain at Publish101.com
* May 15 - Googlebot arrives and eagerly crawls 245 pages on new domain after looking for, but not finding robots.txt file. Oooops! Gotta add that robots.txt file!
* May 16 - Googlebot returns for 5 more pages and stops. Slurp greedily gobbles 1480 pages and 1892 bad links! Those bad links were caused by our email masking meant to keep out bad bots. How ironic slurp likes these. * May 17 - Slurp finds 1409 more masking links & only 209 new content pages. MSNbot visits for first time and asks for robots.txt 75 times during day, but leaves when it finds that file missing! Finally get around to add robots.txt by days end & stop slurp crawling email masking links and let MSNbot know it's safe to come in!
* May 23 - Teoma spider shows up for first time and crawls 93 pages. Site gets slammed by BecomeBot, a spider that hits a page every 5 to 7 seconds and strains our resources with 2409 rapid fire requests for pages. Added BecomeBot to robots.txt exclusion list to keep 'em out.
* May 24 - MSNbot has stopped showing up for a week since finding robots.txt file missing. Slurp is showing up every few hours looking at robots.txt and leaving again without crawling anything now that it is excluded from email masking links. BecomeBot appears to be honoring robots.txt exclusion but asks for that file 109 times during day. Teoma crawls 139 more pages.
* May 25 - We realize that we need to re-allocate server resources and database design and this requires changes to URL's, which means all previously crawled pages are now bad links! Implement subdomains and wonder what now? Slurp shows up and finds thousands of new email masking links as robots.txt was not moved to new directory structures. Spiders are getting errors pages upon new visits. Scampering to put out fires after wide-ranging changes to site, we miss this for a week. Spider action is spotty for 10 days until we fix robots.txt