Continued from page 1
ia_archiver is crawler name for wayback machine that you may have heard of, and / after disallow tells ai_archiver not to index any of your site. The # allows you to write comments to yourself so you can keep track of what you typed.
Type above three lines into notepad from your computer and save it to root directory of your web site as robots.txt. Web crawlers look for this document first at a web site before doing anything else. This helps crawler to do its job, and helps web site owner tell spider what to do. Say for instance you have some data that you don't want crawlers to see. (Like duplicate content for other browser referrer pages)
You can deter crawlers from indexing 'duplicate' directory by typing this into your robots.txt file. Or if you would like to have robots.txt file created for you, visit www.rietta.com/robogen. To validate your robots.txt file to make sure it works properly you can visit www.searchengineworld.com/cgi-bin/robotcheck.cgi
User-agent: * Disallow: /duplicate/
The * after user-agent says that this action applies to all crawlers and /duplicate/ after disallow tells all crawlers to ignore this directory and not search it. For each user-agent and disallow line there must be a blank space between them in order for it to function correctly. So this is how you would create above two commands into a robots.txt file:
# this identifies wayback machine User-agent: ia_archiver Disallow: /
User-agent: * Disallow: /duplicate/
One thing to note that is very important: Anyone can access robots.txt file of a site. So if you have information that you don't want anyone to see don't include it into robots.txt file. If directory that you don't want anyone to see is not linked to from your web site crawlers won't index it anyway.
An alternative to blocking indexing of your site is to put a meta tag into page. It looks like this:
You put this into tag of your web page. This line tells robot crawlers not to index (search) page and not to follow any of hyperlinks on page. So as an example tells robots crawlers to not index page, but follow hyperlinks on this page.
Did you know that Google has its own tag?
It looks like this: This tells Google robot crawler not to index page, not to follow any of links, and not to keep from storing cached versions of your web site. You will want this done if you update content on your site frequently. This prevents web user from seeing outdated content that isn't refreshed because of storage in cache. You can use tag to specifically talk to Google's robots to avoid complications or if you are optimizing your site for Google's search engine. This concludes this month's article.
Until next article have a great day!
Copyright © Michael Rock Web development contractor (Web Design and Hosting) Internet Presence http://www.TheInternetPresence.com
The owner of this registered company has over twenty years experience with DOS, windows business applications, numerous programming languages, artistic development, and web design. Other areas of interest include web marketing, web promoting, and business marketing and development. After the persuasion of those praising his work, he decided to go into business himself and highly suggests everyone else to do the same.