Continued from page 1
ia_archiver is
crawler name for
wayback machine that you may have heard of, and
/ after disallow tells ai_archiver not to index any of your site. The # allows you to write comments to yourself so you can keep track of what you typed.
Type
above three lines into notepad from your computer and save it to
root directory of your web site as robots.txt. Web crawlers look for this document first at a web site before doing anything else. This helps
crawler to do its job, and helps
web site owner tell
spider what to do. Say for instance you have some data that you don't want
crawlers to see. (Like duplicate content for other browser referrer pages)
You can deter crawlers from indexing
'duplicate' directory by typing this into your robots.txt file. Or if you would like to have
robots.txt file created for you, visit www.rietta.com/robogen. To validate your robots.txt file to make sure it works properly you can visit www.searchengineworld.com/cgi-bin/robotcheck.cgi
User-agent: * Disallow: /duplicate/
The * after user-agent says that this action applies to all crawlers and /duplicate/ after disallow tells all crawlers to ignore this directory and not search it. For each user-agent and disallow line there must be a blank space between them in order for it to function correctly. So this is how you would create
above two commands into a robots.txt file:
# this identifies
wayback machine User-agent: ia_archiver Disallow: /
User-agent: * Disallow: /duplicate/
One thing to note that is very important: Anyone can access
robots.txt file of a site. So if you have information that you don't want anyone to see don't include it into
robots.txt file. If
directory that you don't want anyone to see is not linked to from your web site
crawlers won't index it anyway.
An alternative to blocking indexing of your site is to put a meta tag into
page. It looks like this:
You put this into
tag of your web page. This line tells
robot crawlers not to index (search)
page and not to follow any of
hyperlinks on
page. So as an example tells
robots crawlers to not index
page, but follow
hyperlinks on this page.
Did you know that Google has its own tag?
It looks like this: This tells
Google robot crawler not to index
page, not to follow any of
links, and not to keep from storing cached versions of your web site. You will want this done if you update
content on your site frequently. This prevents
web user from seeing outdated content that isn't refreshed because of storage in
cache. You can use
tag to specifically talk to Google's robots to avoid complications or if you are optimizing your site for Google's search engine. This concludes this month's article.
Until
next article have a great day!
Copyright © Michael Rock Web development contractor (Web Design and Hosting) Internet Presence http://www.TheInternetPresence.com
