How to Control Search Engine Robots
Wouldn't it be nice to be able to leave some code in your web site to tell search engine spider crawlers to make your site number one? Unfortunately a robots.txt file or robots meta tag won't do that, but they can help crawlers to index your site better and block out unwanted ones.
First a little definition explaining:
Search Engine Spiders or Crawlers - A web crawler (also known as web spider) is a program which browses World Wide Web in a methodical, automated manner. Web crawlers are mainly used to create a copy of all visited pages for later processing by a search engine, that will index downloaded pages to provide fast searches.
A web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit. As it visits these URLs, it identifies all hyperlinks in page and adds them to list of URLs to visit, recursively browsing Web according to a set of policies.
Robots.txt - The robots exclusion standard or robots.txt protocol is a convention to prevent well-behaved web spiders and other web robots from accessing all or part of a website. The information specifying parts that should not be accessed is specified in a file called robots.txt in top-level directory of website.
The robots.txt protocol is purely advisory, and relies on cooperation of web robot, so that marking an area of your site out of bounds with robots.txt does not guarantee privacy. Many web site administrators have been caught out trying to use robots file to make private parts of a website invisible to rest of world. However file is necessarily publicly available and is easily checked by anyone with a web browser.
The robots.txt patterns are matched by simple substring comparisons, so care should be taken to make sure that patterns matching directories have final '/' character appended: otherwise all files with names starting with that substring will match, rather than just those in directory intended.
Meta Tag - Meta tags are used to provide structured data about data.
In early 2000s, search engines veered away from reliance on Meta tags, as many web sites used inappropriate keywords, or were keyword stuffing to obtain any and all traffic possible.
Some search engines, however, still take Meta tags into some consideration when delivering results. In recent years, search engines have become smarter, penalizing websites that are cheating (by repeating same keyword several times to get a boost in search ranking). Instead of going up rankings, these websites will go down in rankings or, on some search engines, will be kicked off of search engine completely.
Index a site - The act of crawling your site and gathering information.How can robots.txt file and meta tag help you?
In robots.txt you can tell harmful 'web crawlers' to leave your web site alone, and give helpful hints to ones you want to crawl your site. Here is an example on how to disallow a web crawler to search your site:
# this identifies wayback machine
User-agent: ia_archiver
Disallow: /
ia_archiver is crawler name for wayback machine that you may have heard of, and / after disallow tells ai_archiver not to index any of your site. The # allows you to write comments to yourself so you can keep track of what you typed.
Type above three lines into notepad from your computer and save it to root directory of your web site as robots.txt. Web crawlers look for this document first at a web site before doing anything else. This helps crawler to do its job, and helps web site owner tell spider what to do. Say for instance you have some data that you don't want crawlers to see. (Like duplicate content for other browser referrer pages) You can deter crawlers from indexing 'duplicate' directory by typing this into your robots.txt file.
Or if you would like to have robots.txt file created for you, visit www.rietta.com/robogen. To validate your robots.txt file to make sure it works properly you can visit www.searchengineworld.com/cgi-bin/robotcheck.cgi
User-agent: *