Home » Internet » Search Engine Optimization

Search Engine Spiders And Your Robots.txt File

Oct 11, 2007
In this article we will discuss search engine spiders and what they do. You will also learn how to create a robots.txt file and why you might need one.

Search engine spiders are automated software programs that crawl the Web looking for pages to feed to search engines. They are also called crawlers, robots and bots. Spiders are one of the most useful programs on the internet. They are a key part in how the search engines operate. Spiders allow your site to be found by the millions of people who use search engines. Feed the spiders right and they will tell the search engines about your site.

How Spiders Work

A search engine is an index to the Internet, search engines point to relevant web sites depending on your search. Search engines need a tool that is able to visit websites, navigate the websites, decide what the website is about and add that data to the search engine.

Spiders are essentially programs that "crawl" sites and report back to their boss their findings. Their purpose in life is to make it easy for your site to get listed in search engines.

Spiders work by finding links to web sites, visiting those web sites, going through the content of a web site and then reporting the content of the site back to the database of the search engine they work for. From there, the information is added to the search engine, and the site then shows up in search results.

The robots.txt file

By defining a few rules, you can tell robots to not crawl certain directories or files, within your site. Web sites do not absolutely have to have a robots.txt file, they can get along just fine without one. Most spiders look for a robots.txt file as soon as they arrive on your site. Take a look at your site statistics. If your statistics has a "files not found" section, you may see many entries where spiders failed to find the file on your site.

The default behavior is to allow all unless you have a Disallow for that resource. If you wish to exclude some of your pages from search engine indexing, this is the tool approved by the search engines. Creating a robots.txt file that guides spiders is simple.

If you want to allow the spiders to crawl your site but exclude directories of your choice, copy and paste the following into a blank txt file:

User-agent: *
Disallow: /directory1/
Disallow: /directory2/
Disallow: /directory3/

To exclude files of your choice, type in the path to the files you want to exclude:

User-agent: *
Disallow: /directory1/page1.html
Disallow: /directory2/page2.html
Disallow: /directory3/page3.html

To exclude all the search engine spiders from your entire web site, copy and paste the following into the txt file:

User-agent: *
Disallow: /

This will keep a specific search engine spider from indexing your site:

User-agent: Name_of_Robot
Disallow: /

To allow a single robot and exclude all other robots:

User-agent: Googlebot
User-agent: *
Disallow: /

There can only be one robots.txt on a site, and you may not have blank lines in a record. Once you have it the way you want, save the file as "robots" and as a .txt file. Uploading the file to the root directory of your site, that is the directory where your home page or index page is. Put the robots.txt file right alongside the index file.
About the Author
Sign up for the Web Success Weekly Email. Lean simple, step by step methods to get your business online and making money, the easy way: http://websuccess.info/

By Harvey Lew Robinson: http://websuccess.info/seo/spiders.html
Please Rate:
(Average: Not rated)
Views: 196
Print Email Report Share
Article Categories