Calling out bad crawlers: the Kintiskton nuisance
Posted Feb 17th, 2009 by David Calhoun in UncategorizedI have never been involved in creating a web crawler, but as a website owner I’m well aware of the behavior of good crawlers versus bad crawlers. For instance, a good crawler must not only follow the rules set by robots.txt, but it must also not impose an undue load on the server being indexed.
Famously, Cuil exhibited this bad behavior for at least several months before they claimed to have fixed it. In any case, I had to ban their IP range because they were just hitting my site too hard (compared to all of the other major crawlers out there).
Today I’m looking at my traffic stats for my WWI flight sim site and I see that yesterday I got over 200% new visitors. Strange thing was there was no major referring site, only direct hits! What on earth! So I check the logs and find that most of the IPs are from the range 65.208.151.112-65.208.151.119, which resolves to kintiskton-gw.customer.alter.net [63.114.61.170] before the tracert dies.
Apparently this IP block is owned by Kintiskton LLC, whatever that is. When I do a Google search, I can’t find the actual company, only complaints about its crawler abusing people’s websites going back to December 2008 (several months).
The IP block is hosted by Verizon Business, so I shot over an email to abuse@verizon.net. After several months of this Kintiskton doing their excessive crawling, hopefully Verizon will eventually step up and look into it. Apparently they haven’t yet…
In the meantime, it’s good old Apache to the rescue.
I’ll be adding this to my .htaccess file:
Deny from 65.208.151.112
Deny from 65.208.151.113
Deny from 65.208.151.114
Deny from 65.208.151.115
Deny from 65.208.151.116
Deny from 65.208.151.117
Deny from 65.208.151.118
Deny from 65.208.151.119
Leave a Reply
Categories
- accessibility (1)
- browser bugs (2)
- css (6)
- html (6)
- javascript (9)
- jquery (3)
- mobile (1)
- performance (2)
- php (1)
- regular expressions (1)
- rss (3)
- seo (1)
- Site News (1)
- table (1)
- Uncategorized (4)
- videos (2)
- wordpress (1)
- xml (2)
- yui (0)
I have the exact same problem. I just ban those IP’s by default now on every site I work on. I guess that is all we can do to stop them. Thanks Verizon Business for being a nuisance.
I blocked this bot today. It’s annoying because it loads images and everything. And does it all at once, like you said.
It’s shorter to use the CIDR notation for the .htaccess file:
Deny from 65.208.151.112/29
[...] that made choose not to share my web content with Kintiskton. I first added the subnet to my .htaccess rules but decided to add it to my firewall rules. That took care of [...]
I am getting very similar activity from cuil.com.
Direct hits from 216.129.119.16.
Google Adsense shows the hits as page impressions for one day then they’re gone the next.
Hi. They came on to my site and I received over 600 hits from them in approx half an hour. I was able to exempt my statcounter.com from logging their IP address, but that doesn’t mean they can no longer access my site.
I use blogger. Is there a place in my Template where I can insert info that would deny them even loading my page? I am computer not-too-smart and so if you can help in kid language, that would be ideal
Cheers & thanks.
Just blocked them. Wondered what was making my server so busy.
They were just crawling my site pretty heavily but I was watching in real time my site traffic, so they done a bit of crawling until I made an htaccess file with:
deny from 65.208.151.
And it stopped them within seconds
I chose to deny the the whole range (1-255) because no doubt they will use new ip’s soon enough but they should only change their server ip’s (meaning only that last number will change).
If anyone wants to stop them crawling, simply make a text file, and write in it:
deny from 65.208.151.
and save it. Then upload the txt file to your web hosting and rename it .htaccess and its done
You can’t rename your text file to .htaccess in windows you have to do it in your ftp when the file hits your web server because windows FORCES you to put something before a . in a file name. Linux/UNix web servers do not.
It even parses urls in JavaScript code without correctly executing the code itself and ended up hitting my site with 404s.