Working with robots.txt file What is robots.txt file?
Working with robots.txt file
Advantages of robots.txt
Disadvantages of robots.txt file
Optimization of robots.txt file
Using robots.txt file
What is robots.txt file?
The robots.txt file is an ASCII text file that has specific instructions for search engine robots about specific content that they are not allowed to index. These instructions are deciding factor of how a search engine indexes your website's pages. The universal address of robots.txt file is: www.domain.com/robots.txt . This is first file that a robot visits. It picks up instructions for indexing site content and follows them. This file contains two text fields. Lets study this robots.txt example :
User-agent: * Disallow:
The User-agent field is for specifying robot name for which access policy follows in Disallow field. Disallow field specifies URLs which specified robots have no access to. A robots.txt example :
User-agent: * Disallow: /
Here "*" means all robots and "/ " means all URLs. This is read as, " No access for any search engine to any URL" Since all URLs are preceded by "/ " so it bans access to all URLs when nothing follows after "/ ". If partial access has to be given, only banned URL is specified in Disallow field. Lets consider this robots.txt example :
# Research access for Googlebot. User-agent: Googlebot Disallow:
User-agent: * Disallow: /concepts/new/
Here we see that both fields have been repeated. Multiple commands can be given for different user agents in different lines. The above commands mean that all user agents are banned access to /concepts/new/ except Googlebot which has full access. Characters following # are ignored up to line termination as they are considered to be comments.
Working with robots.txt file : -
The robots.txt file is always named in all lowercase (e.g. Robots.txt or robots.Txt is incorrect)
Wildcards are not supported in both fields. Only * can be used in User-agent fields' command syntax because it is a special character denoting "all". Googlebot is only robot that now supports some wildcard file extensions. Ref: http://www.google.com/webmasters/faq.html#12
The robots.txt file is an exclusion file meant for search engine robot reference and not obligatory for a website to function. An empty or absent file simply means that all robots are welcome to index any part of website.
Only one robots.txt file can be maintained per domain.
Website owners who do not have administrative rights cannot sometimes make a robots.txt file. In such situations, Robots Meta Tag can be configured which will solve same purpose. Here we must keep in mind that lately, questions have been raised about robot behavior regarding Robot Meta Tag. Some robots might skip it altogether. Protocol makes it obligatory for all robots to start with robots.txt thereby making it default starting point for all robots.
Separate lines are required for specifying access to different user agents and Disallow field should not carry more than one command in a line in robots.txt file. There is no limit to number of lines though i.e. both User-agent and Disallow fields can be repeated with different commands any number of times. Blank lines will also not work within a single record set of both commands.
Use lower-case for all robots.txt file content. Please also note that filenames on Unix systems are case sensitive. Be careful about case sensitivity when defining directory or files for Unix hosted domains. You can use this great tool to check your robots.txt from www.searchengineworld.com:
The robots.txt Validator
Please note that full path to robots.txt file must be entered in field.
Advantages of robots.txt file : -
Protocol demands that all search engine robots start with robots.txt file. This is default entry point for robots if file is present. Specific instructions can be placed on this file to help index your site on web. Major search engines will never violate Standard for Robots Exclusion.
The robots.txt file can be used to keep out unwanted robots like email retrievers, image strippers etc. The robots.txt file can be used to specify directories on your server that you don't want robots to access and/or index e.g. temporary, cgi, and private/back-end directories. An absent robots.txt file could generate a 404 error and redirect robot to your default 404 error page. Here it was noticed after careful research that sites that do not have a robots.txt file present and had a customized 404-error page, would serve same to robots. The robot is bound to treat it as robots.txt file, which can confuse it's indexing. The robots.txt file is used to direct select robots to relevant pages to be indexed. This specially comes in handy where site has multilingual content or where robot is searching for only specific content. The need for robots.txt file was also felt to stop robots from deluging servers with rapid-fire requests or re-indexing same files repeatedly. If you have duplicate content on your site for any reason, same can be controlled from getting indexed. This will help you avoid any duplicate content penalties.
Disadvantages of robots.txt file : -
Careless handling of directory and filenames can lead hackers to snoop around your site by studying robots.txt file, as you sometimes may also list filenames and directories that have classified content. This is not a serious issue as deploying some effective security checks to content in question can take care of it. For example if you have your traffic log on your site on a URL such as www.domain.com/stats/index.htm which you do not want robots to index, then you would have to add a command to your robots.txt file. As an example:
User-agent: * Disallow: /stats/
However, it is easy for a snooper to guess what you are trying to hide and simply typing URL www.domain.com/stats in his browser would enable access to same. This calls for one of following remedies -