Semalt Provides Tips On How To Deal With Bots, Spiders And Crawlers

Apart from creating search engine friendly URLs, the .htaccess file lets webmasters block specific bots from accessing their website. One way to block these robots is through the robots.txt file. However, Ross Barber, the Semalt Customer Success Manager, states that he has seen some crawlers ignoring this request. One of the best ways is to use the .htaccess file to stop them from indexing your content.

What are these bots?

They are a type of software used by search engines to delete new content from the internet for indexing purposes.

They perform the following tasks:

  • Visit web pages that you've linked to
  • Check your HTML code for errors
  • They save what web pages you're linking to and see what web pages link to your content
  • They index your content

However, some bots are malicious and search your site for email addresses and forms that are usually used to send you unwanted messages or spam. Others even look for security loopholes in your code.

What is needed to block web crawlers?

Before using the .htaccess file, you need to check the following things:

1. Your site must be running on an Apache server. Nowadays, even those web hosting companies half decent in their job, give you access to the required file.

2. You should have access to you're the raw server logs of your website so that you can locate what bots have been visiting your web pages.

Note there is no way you'll be able to block all harmful bots unless you block all of them, even those you consider to be helpful. New bots come up every day, and older ones are modified. The most efficient way is to secure your code and make it hard for bots to spam you.

Identifying bots

Bots can either be identified by the IP address or from their "User Agent String," which they send in the HTTP headers. For instance, Google uses "Googlebot."

You may need this list with 302 bots if you already have the name of the bot that you would like to keep away using .htaccess

Another way is to download all the log files from the server and open them using a text editor. Their location on the server may change depending on your server's configuration. If you cannot find them, seek assistance from your web host.

If you know what page was visited, or the time of visit, it's easier to come with an unwanted bot. You could search the log file with these parameters.

Once, you've noted what bots you need to block; you can then include them in the .htaccess file. Please note that blocking the bot isn't enough to stop it. It may come back with a new IP or name.

How to block them

Download a copy of the .htaccess file. Make backups if required.

Method 1: blocking by IP

This code snippet blocks the bot using the IP address 197.0.0.1

Order Deny, Allow

Deny from 197.0.0.1

The first line means that the server will block all requests matching the patterns you've specified and allow all others.

The second line tells the server to issue a 403: forbidden page

Method 2: Blocking by User agents

The easiest way is to use Apache's rewrite engine

RewriteEngine On

RewriteCond %{HTTP_USER_AGENT} BotUserAgent

RewriteRule . - [F, L]

The first line ensures that the rewrite module is enabled. Line two is the condition which the rule applies to. The "F" in line 4 tells the server to return a 403: Forbidden while the "L" means this is the last rule.

You will then upload the .htaccess file to your server and overwrite the existing one. With time, you will need to update the bot's IP. In case you make an error, just upload the backup that you made.