|
Creating
a robots.txt File
Teacher: Sumantra
Roy
Some people believe that they
should create different pages for different search engines,
each page optimized for one keyword and for one search engine.
Now, while I don't recommend that people create different pages
for different search engines, if you do decide to create such
pages, there is one issue that you need to be aware of.
These pages, although optimized
for different search engines, often turn out to be pretty similar
to each other. The search engines now have the ability to detect
when a site has created such similar looking pages and are
penalizing or even banning such sites. In order to prevent
your site from being penalized for spamming, you need to prevent
the search engine spiders from indexing pages which are not
meant for it, i.e. you need to prevent AltaVista from indexing
pages meant for Excite and vice-versa. The best way to do that
is to use a robots.txt file.
You should create a robots.txt
file using a text editor like Windows Notepad. Don't use your
word processor to create such a file.
Here is the basic syntax of the
robots.txt file:
User-Agent: [Spider Name]
Disallow: [File Name]
For instance, to tell AltaVista's
spider, Scooter, not to spider the file named myfile1.html
residing in the root directory of the server, you would write
User-Agent: Scooter
Disallow: /myfile1.html
To tell Excite's spider, called
ArchitextSpider, not to spider the files myfile2.html and myfile3.html,
you would write
User-Agent: ArchitextSpider
Disallow: /myfile2.html
Disallow: /myfile3.html
You can, of course, put multiple
User-Agent statements in the same robots.txt file. Hence, to
tell AltaVista not to spider the file named myfile1.html, and
to tell Excite not to spider the files myfile2.html and myfile3.html,
you would write
User-Agent: Scooter
Disallow: /myfile1.html
User-Agent: ArchitextSpider
Disallow: /myfile2.html
Disallow: /myfile3.html
If you want to prevent all robots
from spidering the file named myfile4.html, you can use the
* wildcard character in the User-Agent line, i.e. you would
write
User-Agent: *
Disallow: /myfile4.html
However, you cannot use the wildcard
character in the Disallow line.
Once you have created the robots.txt
file, you should upload it to the root directory of your domain.
Uploading it to any sub-directory won't work - the robots.txt
file needs to be in the root directory.
I won't discuss the syntax and
structure of the robots.txt file any further - you can get
the complete specifications from http://www.robotstxt.org/wc/norobots.html
Now we come to how the robots.txt
file can be used to prevent your site from being penalized
for spamming in case you are creating different pages for different
search engines. What you need to do is to prevent each search
engine from spidering pages which are not meant for it.
For simplicity, let's assume
that you are targeting only two keywords: "tourism in Australia" and "travel
to Australia". Also, let's assume that you are targeting only
four of the major search engines: AltaVista, Excite, HotBot and Northern
Light.
Now, suppose you have followed
the following convention for naming the files: Each page is
named by separating the individual words of the keyword for
which the page is being optimized by hyphens. To this is added
the first two letters of the name of the search engine for
which the page is being optimized.
Hence, the files for AltaVista
are
tourism-in-australia-al.html
travel-to-australia-al.html
The files for Excite are
tourism-in-australia-ex.html
travel-to-australia-ex.html
The files for HotBot are
tourism-in-australia-ho.html
travel-to-australia-ho.html
The files for Northern Light
are
tourism-in-australia-no.html
travel-to-australia-no.html
As I noted earlier, AltaVista's
spider is called Scooter and Excite's spider is called ArchitextSpider.
A list of spiders for the major
search engines can be found at http://www.searchenginewatch.com/webmasters/spiderchart.html
From this list, we find that
the spider for Northern Light is called Gulliver. HotBot uses Inktomi and
Inktomi's spider is called Slurp. Using this knowledge, here's
what the robots.txt file should contain:
User-Agent: Scooter
Disallow: /tourism-in-australia-ex.html
Disallow: /travel-to-australia-ex.html
Disallow: /tourism-in-australia-ho.html
Disallow: /travel-to-australia-ho.html
Disallow: /tourism-in-australia-no.html
Disallow: /travel-to-australia-no.html
User-Agent: ArchitextSpider
Disallow: /tourism-in-australia-al.html
Disallow: /travel-to-australia-al.html
Disallow: /tourism-in-australia-ho.html
Disallow: /travel-to-australia-ho.html
Disallow: /tourism-in-australia-no.html
Disallow: /travel-to-australia-no.html
User-Agent: Slurp
Disallow: /tourism-in-australia-al.html
Disallow: /travel-to-australia-al.html
Disallow: /tourism-in-australia-ex.html
Disallow: /travel-to-australia-ex.html
Disallow: /tourism-in-australia-no.html
Disallow: /travel-to-australia-no.html
User-Agent: Gulliver
Disallow: /tourism-in-australia-al.html
Disallow: /travel-to-australia-al.html
Disallow: /tourism-in-australia-ex.html
Disallow: /travel-to-australia-ex.html
Disallow: /tourism-in-australia-ho.html
Disallow: /travel-to-australia-ho.html
When you put the above lines
in the robots.txt file, you instruct each search engine not
to spider the files meant for the other search engines.
When you have finished creating
the robots.txt file, double-check to ensure that you have not
made any errors anywhere in it. A small error can have disastrous
consequences - a search engine may spider files which are not
meant for it, in which case it can penalize your site for spamming,
or, it may not spider any files at all, in which case you won't
get top rankings in that search engine.
An useful tool to check the syntax
of your robots.txt file can be found at http://www.tardis.ed.ac.uk/~sxw/robots/check/.
While it will help you correct syntactical errors in the robots.txt
file, it won't help you correct any logical errors, for which
you will still need to go through the robots.txt thoroughly,
as mentioned above.
About the teacher:
Sumantra is one
of the most respected search engine positioning specialists on
the Internet. To have Sumantra's company place your site at the
top of the search engines, go to http://www.1stSearchRanking.com/ For
more advice on how you can take your web site to the top of the
search engines, subscribe to his FREE newsletter by going to http://www.1stSearchRanking.com/newsletter.htm
|