Posts tagged: robots

Get vs. Post Method – Search Engine Robots

Often times we are asked by clients can search engines spider forms or access the information behind forms. The answer is “yes” it can, but you need to make sure you are using the correct “method”. Forms usually use 2 types of methods – the way by which the information on the page is submitted to the server and deliver results back to the web users. These two methods are “get” and “post”. The “get” sends the data as part of the URL. When the form data (or “query data”) is added to the end of the URL it is “URL encoded” so that the data can be used in a standard URL. The post method is the preferred method for sending lengthy form data. When a form is submitted POST the user does not see the form data that was sent.

Now in regards to which has the ability to spider the information behind the form, as explained below by Wikipedia, the “get” method has this ability to get spidered by search engines:

Get vs. Post Method

Get
Requests a representation of the specified resource. Note: GET should not be used for operations that cause side-effects, such as using it for taking actions in web applications. One reason for this is that GET may be used arbitrarily by robots or crawlers, which should not need to consider the side effects that a request should cause. See safe methods below.

Post
Submits data to be processed (e.g., from an HTML form) to the identified resource. The data is included in the body of the request. This may result in the creation of a new resource or the updates of existing resources or both.

Safe Methods
Some methods (for example, HEAD, GET, OPTIONS and TRACE) are defined as safe, which means they are intended only for information retrieval and should not change the state of the server. In other words, they should not have side effects, beyond relatively harmless effects such as logging, caching, the serving of banner advertisements or incrementing a web counter. Making arbitrary GET requests without regard to the context of the application’s state should therefore be considered safe.

By contrast, methods such as POST, PUT and DELETE are intended for actions which may cause side effects either on the server, or external side effects such as financial transactions or transmission of email. Such methods are therefore not usually used by conforming web robots or web crawlers, which tend to make requests without regard to context or consequences.

Despite the prescribed safety of GET requests, in practice their handling by the server is not technically limited in any way, and careless or deliberate programming can just as easily (or more easily, due to lack of user agent precautions) cause non-trivial changes on the server. This is discouraged, because it can cause problems for Web caching, search engines and other automated agents, which can make unintended changes on the server.

How to Block Search Engine Robots

Robots.txt is a text-based file used by many web sites for the purpose of giving specific instructions to search engine “robots” or “spiders”. The file typically tells them what pages or directories they shouldn’t index. It is usually located in the root (main) directory.

The robots.txt file may be created in a basic text editor like Notepad or Edit. Be sure to save it in pure, text-only format. cPanel’s “File Manager” or FTP Client software may be used to upload it. Each line is a separate instruction. Some sample instructions to include in robots.txt are as follows:

Disallow: /email/
Disallow: /contact.php
Disallow: /

The first example blocks the entire “email” directory (folder) from being accessed by search engine spiders, while the 2nd disallows them from indexing the “Contact” page. The 3rd example requests that they not index any files on the site. The initial slash refers to the directory the robots.txt file is in. Do not use full URLs.

Most robots.txt files begin with a line reading “User-agent: *”. The purpose of this is to tell ALL search engine spiders that it applies to them. If the asterisk were replaced with the name of one engine’s robot, instructions would only apply to it and others would ignore them. A 404 error is logged if a robot tries to access the file and it doesn’t exist.

If there is a specific type of document which search engine spiders should be prohibited from indexing (such as PDF, DOC, or RTF), consider putting all of these files in the same folder and adding a “Disallow” statement that specifies this directory.

The overall purpose of using a robots.txt file is usually to control which pages visitors enter the web site through, reduce access to certain pages by “spammers”, and/or limit the amount of bandwith (data transfer) being consumed by search engine spiders as they read from the site.

The robots META tag can be used for much of the same purpose as the robots.txt file, but it is not applicable to non-HTML pages like text files, PDFs, images, and so on. If a web site operator wants all files to be indexed by search engines, there is no real purpose in having a robots.txt file.

Here is an example of a robots.txt file that we have created:

User-agent: Slurp
crawl-delay: 20

User-agent: URL_Spider_Pro
Disallow: /

User-agent: CherryPicker
Disallow: /

User-agent: EmailCollector
Disallow: /

User-agent: EmailSiphon
Disallow: /

User-agent: WebBandit
Disallow: /

The first “delay” in the robots .txt file is for Yahoo Slurp.  They were hitting our site pretty hard and was really slowing down the clients server.  While we do not recommend slowing down a robot that is coming to your site there is always exceptions to the rule.

How to Write META Tags Properly

META tags are incorporated into the HTML code of many webpages. The most common META tags are TITLE, DESCRIPTION, and KEYWORDS. They affect not only how a site appears in search listings, but how high or low a position it receives in results. Read on to learn how to write META tags properly and effectively…

TITLE TAG

Perhaps the most important META tag, the TITLE tag determines the clickable title of a search result, as well as the text which appears on the title bar of a web browser. Here are some tips on how to properly write this tag:

  • Use fewer than sixty five characters; otherwise, part of the title will be cut off by browsers and search engines.
  • Every page on the website needs to have a unique title.
  • Your primary keyword target should be at the beginning of the TITLE tag.
  • Place your company name at the end of the TITLE tag.

DESCRIPTION TAG

The 2nd most important META tag, DESCRIPTION determines what text appears below the title in search results (with some exceptions). People who visit a webpage will not see this tag, unless they look at the source code. Tips on how to use it properly:

  • Keep the description approximately 125-175 characters.
  • Don’t put specific data searchers are looking for in the description tag; they might only read it and not visit the site.
  • Use the description to market to the user and increase the click through to the website.
  • The description is not a ranking factor, but be sure to use your target keyword(s) and keyword phrases so that they will be bold in the search results.

META KEYWORD TAG

A less important META tag is KEYWORDS as many search engines do not use them as a ranking factor.  The META KEYWORD TAG contains one or more search keywords that relate to the web page in question.  Here are a few tips:

  • Avoid using more than fifteen keywords, and don’t write any word more than once.
  • Use words which aren’t in the TITLE or DESCRIPTION tags but that are in the copy of the page.
  • Avoid words that aren’t relevant to the page’s subject.  In fact, using keywords not relevant or included in your page can lead to ranking penalties.
  • To properly separate multiple keywords, use commas.

OTHER META TAGS

Some lesser-known META tags are not used by most search engines and browsers, making them minimally useful. However, one other fairly important META tag is ROBOTS. It gives specific instructions to search engine “robots” or “spiders”; automated computer programs which visit and index websites, recording information that will appear in search results.

You may not need to know HTML to write META tags properly. Programs like Frontpage and Dreamweaver allow users to set the META tags for a webpage design automatically. For example, the function for setting META tags in Frontpage is located in “Page properties”, under the “Custom” tab. The TITLE is set under the “General” tab.

Segment Your Web Site for SEO

A major part of search engine optimization is making sure that all of the pages on your web site are easily accessible by both humans and search engines. Since each page on a web site can be indexed you can also optimize each page for searchers and search engines.

To optimize your site for search engines consider that they use search engine robots to find and index a site. These robots, also called spiders, continually look for content on the web that needs to be indexed. Once they find something these robots will follow the hyperlinks to each web page. This is called “crawling” a site. When the robot arrives on a page it reads through the content and adds it to the index. Because robots do this for every page and this is a way your site pages get added to search engine results it is important to have navigational site structure that is friendly to the robots.

Additionally the search engines will only rank pages that are perceived as important. That’s why it is necessary to create content hierarchy in your navigational structure – your most important pages should be at the top of your site structure. Whichever page is at the top, usually your home page, generally attracts the most links. Often search engine robots stop searching after 3 clicks from the homepage. That’s why it’s important to decide on a hierarchy for your site’s pages.

This leads to categorization. If you want to organize your content in a natural way you should create categories for all of your site’s content. Then link those categories to your homepage. This helps create more key phrases to link to which can help you attract a wider audience for searches.

Keep in mind that search engine robots only follow html links. This means that any links using Flash, JavaScript, dropdown menus or submit buttons don’t get picked up by them and therefore don’t get indexed. Besides that html links are a better choice because the anchor text can describe the destination page for human visitors to see.

Finally, create a sitemap. A sitemap is basically works like an index, listing links to all the pages on your web site. If you link a sitemap to your home page robots have easy access to all your web site’s pages. Be aware though that robots generally follow less than 100 links from one page. If you have more pages than that consider creating a multi-page sitemap.

Launching a New SEO-Friendly Web Site

Launching a new web site with SEO (Search Engine Optimization) in mind is less time consuming than waiting until it has been completed to begin SEO efforts. Putting an emphasis on this from the beginning will prevent the web site owner or designer from wasting time on the creation of pages which aren’t search engine friendly. Here are some tips on launching a new site with SEO in mind.

1. Web site owners who create their own sites should learn about META tags and how to achieve the best keyword density. It is also helpful to understand how search engine “spiders” (or “robots”) work, including the types of content they can and cannot read.

2. If you have someone else design your new web site, find a web designer with SEO experience who is willing to optimize the site in the process of designing it. Some designers have little understanding of SEO and create sites which are very unfavorable in this manner.

3. When initially launching and promoting the web site, be careful not to use advertising methods which are frowned upon by search engines. Such methods include posting to FFA link pages or buying links without the “nofollow” attribute/tag.

4. If the new website needs to have text-based content produced for it before launching, it is best if a writer who is skilled in SEO creates this material. This type of content may include informational articles, blog entries, press releases, or product descriptions.

5. Although it may seem more visually appealing or easier to create, designers should avoid putting paragraphs of text inside images. This is detrimental to SEO in most situations, because search engine spiders can’t read this type of text. Navigation systems should also be search engine friendly, and it is a good idea to create a site map before launching the new website.

6. Some types of SEO oriented promotional techniques can draw attention to a new web site as it is launched, while also improving its search engine ranking. These techniques include submitting the site to article directories, posting it on social bookmarking systems, or using it in a forum signature.

Keeping SEO in mind while launching a new website provides some incidental benefits as well. Sites which are search engine friendly are often less difficult to access for people who have less common web browsers or are visually impaired. Sites designed with SEO in mind also tend to have easier navigation and are better-organized.

Block Robots Using .htaccess

Most malicious robots don;t honor your requested exclusion in robots.txt, but you can find lots of examples across the internet that use ModRewrite.  Here is one of the best we could find from:

Robot Article 

# User-Agents with no privileges (mostly spambots/spybots/offline downloaders that ignore robots.txt)
RewriteCond %{REMOTE_ADDR} “^63\.148\.99\.2(2[4-9]|[3-4][0-9]|5[0-5])$” [OR] # Cyveillance spybot
RewriteCond %{REMOTE_ADDR} ^12\.148\.196\.(12[8-9]|1[3-9][0-9]|2[0-4][0-9]|25[0-5])$ [OR] # NameProtect spybot
RewriteCond %{REMOTE_ADDR} ^12\.148\.209\.(19[2-9]|2[0-4][0-9]|25[0-5])$ [OR] # NameProtect spybot
RewriteCond %{REMOTE_ADDR} ^64\.140\.49\.6([6-9])$ [OR] # Turnitin spybot
RewriteCond %{HTTP_REFERER} iaea\.org [OR] # spambot
RewriteCond %{HTTP_USER_AGENT} ^[A-Z]+$ [OR] # spambot
RewriteCond %{HTTP_USER_AGENT} anarchie [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} Atomz [OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} cherry.?picker [NC,OR] # spambot
RewriteCond %{HTTP_USER_AGENT} “compatible ; MSIE 6.0″ [OR] # spambot (note extra space before semicolon)
RewriteCond %{HTTP_USER_AGENT} crescent [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} “^DA \d\.\d+” [OR] # OD
RewriteCond %{HTTP_USER_AGENT} “DTS Agent” [OR] # OD
RewriteCond %{HTTP_USER_AGENT} “^Download” [OR] # OD
RewriteCond %{HTTP_USER_AGENT} EasyDL/\d\.\d+ [OR] # OD
RewriteCond %{HTTP_USER_AGENT} e?mail.?(collector|magnet|reaper|siphon|sweeper|harvest|collect|wolf) [NC,OR] # spambot
RewriteCond %{HTTP_USER_AGENT} express [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} extractor [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} “Fetch API Request” [OR] # OD
RewriteCond %{HTTP_USER_AGENT} flashget [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} FlickBot [OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} FrontPage [OR] # stupid user trying to edit my site
RewriteCond %{HTTP_USER_AGENT} getright [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} go.?zilla [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} “efp@gmx\.net” [OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} grabber [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} imagefetch [OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} httrack [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} “Indy Library” [OR] # spambot
RewriteCond %{HTTP_USER_AGENT} “^Internet Explore” [OR] # spambot
RewriteCond %{HTTP_USER_AGENT} ^IE\ \d\.\d\ Compatible.*Browser$ [OR] # spambot
RewriteCond %{HTTP_USER_AGENT} “LINKS ARoMATIZED” [OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} “Microsoft URL Control” [OR] # spambot
RewriteCond %{HTTP_USER_AGENT} “mister pix” [NC,OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} “^Mozilla/4.0$” [OR] # dumb bot
RewriteCond %{HTTP_USER_AGENT} “^Mozilla/\?\?$” [OR] # formmail attacker
RewriteCond %{HTTP_USER_AGENT} MSIECrawler [OR] # IE’s “make available offline” mode
RewriteCond %{HTTP_USER_AGENT} ^NG [OR] # unknown bot
RewriteCond %{HTTP_USER_AGENT} offline [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} net.?(ants|mechanic|spider|vampire|zip) [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} nicerspro [NC,OR] # spambot
RewriteCond %{HTTP_USER_AGENT} ninja [NC,OR] # Download Ninja OD
RewriteCond %{HTTP_USER_AGENT} NPBot [OR] # NameProtect spybot
RewriteCond %{HTTP_USER_AGENT} PersonaPilot [OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} snagger [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} Sqworm [OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} SurveyBot [OR] # rude bot
RewriteCond %{HTTP_USER_AGENT} tele(port|soft) [NC,OR] # OD
RewriteCond %{HTTP_USER_AGENT} TurnitinBot [OR] # Turnitin spybot
RewriteCond %{HTTP_USER_AGENT} vayala [OR] # dumb bot, doesn't know how to follow links, generates lots of 404s
RewriteCond %{HTTP_USER_AGENT} zeus [NC]
RewriteRule .* - [F,L]