Continued from page 1
Let's suppose yours is a dynamic database site containing information of your newsletter subscribers, customers, their address, phone numbers etc. All these confidential information is kept in a separate directory called "admin". (It is recommended to keep such information in a separate directory. Handling data will be easier for you and so will be easy to keep
search engines away. We will just know how.) I am sure you would never want any unauthorized person to visit this area leave alone
search engines. It does not help
search engines either since they have nothing to do with
data or files there. Here comes
role of a robots.txt file. Write
following in
robots.txt file: (Ignore
horizontal row - they are included only to separate
commands from rest of
text.)
--------------------------------------------------------------------------------
User-agent: * Disallow: /admin/
--------------------------------------------------------------------------------
This does not allow
spiders to index anything in
admin directory also including sub-directories if any.
The asterisk (*) mark indicates all
search engines. How do you stop a particular search engine from spidering your files or directory?
Suppose you want to stop Excite from spidering this directory:
--------------------------------------------------------------------------------
User-agent: ArchitextSpider Disallow: /admin/
--------------------------------------------------------------------------------
Suppose you want to stop Excite and Google from spidering this directory:
--------------------------------------------------------------------------------
User-agent: ArchitextSpider Disallow: /admin/
User-agent: Googlebot Disallow: /admin/
--------------------------------------------------------------------------------
Files are no different. Suppose you want a file datafile.html not to be spidered by Excite:
--------------------------------------------------------------------------------
User-Agent: ArchitextSpider Disallow: /datafile.html
--------------------------------------------------------------------------------
Similarly, you do not want it to be spidered by Google too:
--------------------------------------------------------------------------------
User-agent: ArchitextSpider Disallow: /datafile.html
User-agent: Googlebot Disallow: /datafile.html
--------------------------------------------------------------------------------
Suppose you want two files datafile1.html and datafile2.html not to be spidered by Excite:
--------------------------------------------------------------------------------
User-Agent: ArchitextSpider Disallow: /datafile1.html Disallow: /datafile2.html
--------------------------------------------------------------------------------
Can you guess what does
following mean?
--------------------------------------------------------------------------------
User-agent: ArchitextSpider Disallow: /datafile1.html Disallow: /datafile2.html
User-agent: Googlebot Disallow: /datafile1.html
--------------------------------------------------------------------------------
Excite will not spider datafile1.html and datafile2.html, but Google will not spider only datafile1.html. It will spider datafile2.html and
rest of
files in
directory.
Imagine you have a file kept in a sub-directory that you wouldn't like to be spidered. What do you do? Lets suppose
sub-directory is "official" and
file is "confidential.html".
--------------------------------------------------------------------------------
User-agent: * Disallow: /official/confidential.html
--------------------------------------------------------------------------------
I hope that's enough. A little practice is of course required. If
syntax of your robots.txt file is not written correctly,
search engines will ignore that particular command. Before uploading
robots.txt file double check for any possible errors. You should upload robots.txt file in
ROOT Directory of your server. The search engines look for robots.txt file only in
root directory else they totally ignore it. Mostly root directory is
directory where
index page is kept. In that case keep
robots.txt file in
same directory as
index file.
I know a user-friendly software that will write robots command for you (the software is introduced at
beginning of this article). It can make error-free robots.txt file very easily. This software RoboGen is a great tool. Never bother ever again to check
syntax of your robots.txt file or even write a robots.txt file yourself. RoboGen is a visual editor for Robot Exclusion Files and is easy to use. Just select files you want to be visited or not to be visited by
search engines, and it creates
robots.txt file. You can also select
search engines of your choice. RoboGen maintains a database of over 180 search engine user-agents, which are selectable from a drop down menu. It is
BEST and ONLY software on
Internet to write robots.txt file correctly and effectively. This great tool is cheaper than you expect. CLICK HERE NOW to know more!
Note: You should be able to see robots.txt file if you type
following in
address bar of your Internet browser.
http://www.your-domain.com/robots.txt
(Where your-domain is
domain name of your website. If yours is not a .com site, replace .com with
respective extension your website. For e.g. .net, .us, .org etc.)
You must be wondering whether to use Meta tag or Robots.txt or which of these is more effective!
A robots.txt correctly written is more effective than
meta tag. All search engines support robots.txt, but not all search engines support robots command written in
meta tags. I recommend that you use both so that you cover your site in both
scenarios. RoboGen will help you to write both!
One last thing - You can look in your web server log files to see what search engine robots have visited. They all leave signatures that can be detected. These signatures are nothing but name of their robots. For instance if Google has spidered your site it will leave a log file called Googlebot. This is how you know which search engine has spidered your pages and when!
