twitter  facebook  feedburner  youtube  linkedin

 

How to Block Search Engine Robots

Robots.txt is a text-based file used by many web sites for the purpose of giving specific instructions to search engine “robots” or “spiders”. The file typically tells them what pages or directories they shouldn’t index. It is usually located in the root (main) directory.

The robots.txt file may be created in a basic text editor like Notepad or Edit. Be sure to save it in pure, text-only format. cPanel’s “File Manager” or FTP Client software may be used to upload it. Each line is a separate instruction. Some sample instructions to include in robots.txt are as follows:

Disallow: /email/
Disallow: /contact.php
Disallow: /

The first example blocks the entire “email” directory (folder) from being accessed by search engine spiders, while the 2nd disallows them from indexing the “Contact” page. The 3rd example requests that they not index any files on the site. The initial slash refers to the directory the robots.txt file is in. Do not use full URLs.

Most robots.txt files begin with a line reading “User-agent: *”. The purpose of this is to tell ALL search engine spiders that it applies to them. If the asterisk were replaced with the name of one engine’s robot, instructions would only apply to it and others would ignore them. A 404 error is logged if a robot tries to access the file and it doesn’t exist.

If there is a specific type of document which search engine spiders should be prohibited from indexing (such as PDF, DOC, or RTF), consider putting all of these files in the same folder and adding a “Disallow” statement that specifies this directory.

The overall purpose of using a robots.txt file is usually to control which pages visitors enter the web site through, reduce access to certain pages by “spammers”, and/or limit the amount of bandwith (data transfer) being consumed by search engine spiders as they read from the site.

The robots META tag can be used for much of the same purpose as the robots.txt file, but it is not applicable to non-HTML pages like text files, PDFs, images, and so on. If a web site operator wants all files to be indexed by search engines, there is no real purpose in having a robots.txt file.

Here is an example of a robots.txt file that we have created:

User-agent: Slurp
crawl-delay: 20

User-agent: URL_Spider_Pro
Disallow: /

User-agent: CherryPicker
Disallow: /

User-agent: EmailCollector
Disallow: /

User-agent: EmailSiphon
Disallow: /

User-agent: WebBandit
Disallow: /

The first “delay” in the robots .txt file is for Yahoo Slurp.  They were hitting our site pretty hard and was really slowing down the clients server.  While we do not recommend slowing down a robot that is coming to your site there is always exceptions to the rule.

Related posts:

  1. Search Engine Optimization and Flash
  2. 10 Steps to Higher Search Engine Positioning
  3. Local Search Engine Optimization Tips

Leave a Reply