LSTSRV-L Archives

LISTSERV Site Administrators' Forum

LSTSRV-L

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Topic: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Andy Smith-Petersen <[log in to unmask]>
Tue, 29 May 2007 09:44:39 -0400
text/plain (45 lines)
Googlebot was over-crawling us recently - hundreds of thousands of hits
daily, despite a relatively small number of public archives on our site.
>From looking at the web server logs, it was clear that they were stuck
in a some kind of loop, indexing some posts between two and six times a
day. I did get Google support to decrease the crawl rate a bit, but they
did not fix the loop. (My suspicion is that the large number of URL
parameters for plaintext vs html, fixed width vs variable fonts, etc was
confusing the issue, but I never did nail that down.)

So I added a bunch of entries to our robots.txt file to disallow
crawling of any archives > 1 month old.

-- 
Andy Smith-Petersen
System Administrator
IT Network Services
University of Southern Maine


On Sun, 2007-05-27 at 19:22 -0500, Andrew Bosch wrote:
> It's probably a Googlebot invoking the WA CGI. You will have better
>  success blocking access at your web server or firewall.
> 
> 
> >>> <[log in to unmask]> 5/27/2007 7:12 PM >>>
> Since at least midnight today , every few seconds we're seeing
> 
> 27 May 2007 20:03:16 To   [ANONYMOUS]@LISTSERV.SYR.EDU: ***LOGIN***
> 27 May 2007 20:03:17 From [ANONYMOUS]@LISTSERV.SYR.EDU: X-LOGCK
> 14BF5837AF8379B229 AUTHINFO(66.249.67.57) ORGINFO(66.249.67.57)
> 
> 66.249.67.57 is registered to Google.
> 
> Any ideas what is going on? Any techniques I can use to shut this off?
I
> tried adding a filter *@66.249.67.57, but that doesn't seem to stop
it.
> 
> Thanks,
> Nelson
> 
> -- Syracuse University Listserv List Manager
> -- Listserv webpage: http://listserv.syr.edu
> 

ATOM RSS1 RSS2