[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[EP-tech] Re: How to modify robots.txt and add a new bot?



Seb,

Yes (as normal) you?re right.  I removed the Crawl-delay directive, on ahrefs referenced that, but I got their attention another way.  ?

And as normal - thanks for the help.

-Brian.

Brian D. Gregg
Solutions Architect | Manager Systems Development
University of Pittsburgh | University Library System
Address: 7500 Thomas Blvd.  Room 129 Pittsburgh, PA 15208<https://maps.google.com/maps?q=7500+Thomas+Blvd,+Pittsburgh,+PA&hl=en&sll=41.117935,-77.604698&sspn=7.662465,13.73291&oq=7500+Tho&t=h&hnear=7500+Thomas+Blvd,+Pittsburgh,+Pennsylvania+15208&z=17>
Tel: (412) 648-3264 | Email: bdgregg at pitt.edu<mailto:bdgregg at pitt.edu> | Fax: (412) 648-3585

From: eprints-tech-bounces at ecs.soton.ac.uk [mailto:eprints-tech-bounces at ecs.soton.ac.uk] On Behalf Of sf2
Sent: Friday, December 19, 2014 12:09 PM
To: eprints-tech at ecs.soton.ac.uk
Subject: [EP-tech] Re: How to modify robots.txt and add a new bot?


Brian, if I were you, I'd remove RobotsTxt.pm (is ref'ed in Apache/Rewrite I think) and just have a static "robots.txt" file served by eprints as per any other static file. That will give you direct control on its content. RobotsTxt.pm is not useful at all imo.

Then eprints doesn't do anything special for robots. You can try Crawl-delay (but I don't think that's standard directive so might not be followed by other crawlers). Last thing you can do is black-list it :-)

Seb

On 19.12.2014 17:00, Brian D. Gregg wrote:
As a follow up, I?ve found that the perl_lib/robots.pm that I found is related to AWSTATS ? so that isn?t going to help here.  So please ignore that bit of info.

-Brian.

Brian D. Gregg
Solutions Architect | Manager Systems Development
University of Pittsburgh | University Library System
Address: 7500 Thomas Blvd.  Room 129 Pittsburgh, PA 15208<https://maps.google.com/maps?q=7500+Thomas+Blvd,+Pittsburgh,+PA&hl=en&sll=41.117935,-77.604698&sspn=7.662465,13.73291&oq=7500+Tho&t=h&hnear=7500+Thomas+Blvd,+Pittsburgh,+Pennsylvania+15208&z=17>
Tel: (412) 648-3264 | Email: bdgregg at pitt.edu<mailto:bdgregg at pitt.edu> | Fax: (412) 648-3585

From: eprints-tech-bounces at ecs.soton.ac.uk<mailto:eprints-tech-bounces at ecs.soton.ac.uk> [mailto:eprints-tech-bounces at ecs.soton.ac.uk] On Behalf Of Brian D. Gregg
Sent: Friday, December 19, 2014 11:40 AM
To: eprints-tech at ecs.soton.ac.uk<mailto:eprints-tech at ecs.soton.ac.uk>
Subject: [EP-tech] How to modify robots.txt and add a new bot?

All,

I?ve noticed that we are getting crawled by what seems to be a newer robot ?AhrefsBot? (http://ahrefs.com) That also seems to be ignoring the ?Disallow: /cgi/? stanza as when looking at the logs or the apache server-status it is hitting things in /cgi.  ?

As a first measure to reign this bot in I?d like to add a parameter to the default robots.txt file ?Crawl-Delay: 2? per their documentation (https://ahrefs.com/robot/) but not finding a simple way of doing in EPrints so I started to go through the files and ran across: perl_lib/EPrints/Apache/RobotsTxt.pm where I see what is the default definition for the robots.txt file.  I?ve updated that file and restarted the web server but alas the robots.txt file does not change.

So two questions:

1.       Does anyone have a hint on what needs to be done to identify a new bot correctly?  I?ve also found the perl_lib/robots.pm but not sure where to add the AhrefsBot to the file.

2.       Does anyone know how to update the robots.txt file?  Is it per archive?

Thanks,
Brian Gregg.


Brian D. Gregg
Solutions Architect | Manager Systems Development
University of Pittsburgh | University Library System
Address: 7500 Thomas Blvd.  Room 129 Pittsburgh, PA 15208<https://maps.google.com/maps?q=7500+Thomas+Blvd,+Pittsburgh,+PA&hl=en&sll=41.117935,-77.604698&sspn=7.662465,13.73291&oq=7500+Tho&t=h&hnear=7500+Thomas+Blvd,+Pittsburgh,+Pennsylvania+15208&z=17>
Tel: (412) 648-3264 | Email: bdgregg at pitt.edu<mailto:bdgregg at pitt.edu> | Fax: (412) 648-3585



*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech

*** Archive: http://www.eprints.org/tech.php/

*** EPrints community wiki: http://wiki.eprints.org/

*** EPrints developers Forum: http://forum.eprints.org/



-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20141219/63bad8af/attachment-0001.html