Working with
robots.txt file What is
robots.txt file?
Working with
robots.txt file
Advantages of robots.txt
Disadvantages of
robots.txt file
Optimization of
robots.txt file
Using
robots.txt file
What is
robots.txt file?
The robots.txt file is an ASCII text file that has specific instructions for search engine robots about specific content that they are not allowed to index. These instructions are
deciding factor of how a search engine indexes your website's pages. The universal address of
robots.txt file is: www.domain.com/robots.txt . This is
first file that a robot visits. It picks up instructions for indexing
site content and follows them. This file contains two text fields. Lets study this robots.txt example :
User-agent: * Disallow:
The User-agent field is for specifying robot name for which
access policy follows in
Disallow field. Disallow field specifies URLs which
specified robots have no access to. A robots.txt example :
User-agent: * Disallow: /
Here "*" means all robots and "/ " means all URLs. This is read as, " No access for any search engine to any URL" Since all URLs are preceded by "/ " so it bans access to all URLs when nothing follows after "/ ". If partial access has to be given, only
banned URL is specified in
Disallow field. Lets consider this robots.txt example :
# Research access for Googlebot. User-agent: Googlebot Disallow:
User-agent: * Disallow: /concepts/new/
Here we see that both
fields have been repeated. Multiple commands can be given for different user agents in different lines. The above commands mean that all user agents are banned access to /concepts/new/ except Googlebot which has full access. Characters following # are ignored up to
line termination as they are considered to be comments.
Working with
robots.txt file : -
The robots.txt file is always named in all lowercase (e.g. Robots.txt or robots.Txt is incorrect)
Wildcards are not supported in both
fields. Only * can be used in
User-agent fields' command syntax because it is a special character denoting "all". Googlebot is
only robot that now supports some wildcard file extensions. Ref: http://www.google.com/webmasters/faq.html#12
The robots.txt file is an exclusion file meant for search engine robot reference and not obligatory for a website to function. An empty or absent file simply means that all robots are welcome to index any part of
website.
Only one robots.txt file can be maintained per domain.
Website owners who do not have administrative rights cannot sometimes make a robots.txt file. In such situations,
Robots Meta Tag can be configured which will solve
same purpose. Here we must keep in mind that lately, questions have been raised about robot behavior regarding
Robot Meta Tag. Some robots might skip it altogether. Protocol makes it obligatory for all robots to start with
robots.txt thereby making it
default starting point for all robots.
Separate lines are required for specifying access to different user agents and Disallow field should not carry more than one command in a line in
robots.txt file. There is no limit to
number of lines though i.e. both
User-agent and Disallow fields can be repeated with different commands any number of times. Blank lines will also not work within a single record set of both
commands.
Use lower-case for all robots.txt file content. Please also note that filenames on Unix systems are case sensitive. Be careful about case sensitivity when defining directory or files for Unix hosted domains. You can use this great tool to check your robots.txt from www.searchengineworld.com:
The robots.txt Validator
Please note that
full path to
robots.txt file must be entered in
field.
Advantages of
robots.txt file : -
Protocol demands that all search engine robots start with
robots.txt file. This is
default entry point for robots if
file is present. Specific instructions can be placed on this file to help index your site on
web. Major search engines will never violate
Standard for Robots Exclusion.
The robots.txt file can be used to keep out unwanted robots like email retrievers, image strippers etc. The robots.txt file can be used to specify
directories on your server that you don't want robots to access and/or index e.g. temporary, cgi, and private/back-end directories. An absent robots.txt file could generate a 404 error and redirect
robot to your default 404 error page. Here it was noticed after careful research that sites that do not have a robots.txt file present and had a customized 404-error page, would serve
same to
robots. The robot is bound to treat it as
robots.txt file, which can confuse it's indexing. The robots.txt file is used to direct select robots to relevant pages to be indexed. This specially comes in handy where
site has multilingual content or where
robot is searching for only specific content. The need for
robots.txt file was also felt to stop robots from deluging servers with rapid-fire requests or re-indexing
same files repeatedly. If you have duplicate content on your site for any reason,
same can be controlled from getting indexed. This will help you avoid any duplicate content penalties.
Disadvantages of
robots.txt file : -
Careless handling of directory and filenames can lead hackers to snoop around your site by studying
robots.txt file, as you sometimes may also list filenames and directories that have classified content. This is not a serious issue as deploying some effective security checks to
content in question can take care of it. For example if you have your traffic log on your site on a URL such as www.domain.com/stats/index.htm which you do not want robots to index, then you would have to add a command to your robots.txt file. As an example:
User-agent: * Disallow: /stats/
However, it is easy for a snooper to guess what you are trying to hide and simply typing
URL www.domain.com/stats in his browser would enable access to
same. This calls for one of
following remedies -