The love and hate of bots. They index, spam, scrape, hack, probe and who knows what else people program them to do. They eat up resources which can greatly reduce the performance of your web site. We need the good ones to index our site and bring us visitors so how do we separate the bad ones from the good ones?
All robots are suppose to obey the robots.txt file which indicates what areas of our website we want the bot to crawl. We set up a directory in the robots.txt file which we don’t want robots to crawl. Then add a hidden link in the footer of our site and link to this directory. The hidden link is so that a visitor will never go there. If a bot goes into this directory, we add their IP to a blacklist. Now in our header, we link to a script which checks this blacklist and disallows any bad bots from loading our site.
Read over Jeff’s blackhole script first. I’ve added to Jeff’s implementation for a couple reasons:
- a competitor could potential use a hidden image on their site and link to your blackhole which would ban any of their visitors from visiting your site
- Google may or may not like hidden links. It’s debatable so I thought I’d stay away from hidden links
- I want to give the bot a second chance before banning it
Update Robots File
The first thing you want to do is update your robots.txt to include your robot trap directory.
User-Agent: * Disallow: /robotsfunhouse/*
Leave this for about a week before you do anything else so any robots crawling your site have time to update their index of your robots file.
The Hidden Link
After a week, add this into your footer code or at the bottom of your site:
Now that the bot has ignored our first warning and gone into the /robotsfunhouse/ directory, we give them their final strike.
Forbidden PageThis page is forbidden. If you follow this link, you will be banned from this site. If you came here by mistake, please return Home.
Here we have an index page which shouldn’t be followed by a bot and is nofollow, noindex, noarchive. Any good bot should not be indexing or following any links on this page. We add to the link a hash variable of a secret word and their IP address. This is verified on the process.php page so that we disallow someone from linking to the trap and banning others.
Banning the Bots
Now wait another week and see if you get any bad bots in your blackhole.dat file. Once you’re satisfied in how it’s performing turn it on by adding a link to blackhole.php in your header.
Now sit back and enjoy any time your get a Bad Bots emails, knowing this bot is no longer crawling your site.