[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[EP-tech] Antwort: Announcing eprints2archives
Hi Michael,
thank you for this initiative.
In what sort is your application a replacement for the harvesting by
archive.org?
We observe the bot at archive.org bot visiting in waves our repo, sometimes
harvesting more than one million pages per month. The bot does not respect
robots.txt (which in a default EPrints installation would block /cgi/ to
bots) due to various reasons (see
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fblog.archive.org%2F2017%2F04%2F17%2Frobots-txt-meant-for-search-engines-dont-work-well-for-web-archives%2F&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cf7c2f59fb41a428f031508d850d3155d%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=mnOIdOlbX1r4bKGFvr3adjvKOceIOkH95fmKFaPoxeA%3D&reserved=0
), so also harvesting data in all the various export plugin formats. We
are not sure whether this is a good idea, because a website owner will have
good reasons to protect certain parts of his site. But it is as it is with
archive.org.
On another perspective, we think that offering browse views /view/* is
outdated (corresponds to the web of the 90ies), just generates strain on
the server (the job for creating the views for our 400K author list took
>1.5 days, the pages filled GBs of disk space) without much use for the
end user (who drills through lists of either 10K publications per year or
15K authors per letter in the alphabet?), with limited use for bots - they
get just x variants to get to the same boring eprint and so generate
unnecessary traffic which has to filtered out for statistics - and creates
a high potential for attacks by bad behaving bots. Offering a good
sitemap.xml for bots, replacing lists with lookup (we did so for the
authors), and facetted search provide a much improved experience.
Kind regards,
Martin
Von: "Michael Hucka via Eprints-tech" <eprints-tech at ecs.soton.ac.uk>
An: eprints-tech at ecs.soton.ac.uk
Datum: 03/09/2020 20:37
Betreff: [EP-tech] Announcing eprints2archives
Gesendet von: <eprints-tech-bounces at ecs.soton.ac.uk>
Greetings,
eprints2archives is a new program to archive the web pages of an EPrints
server in public web archiving sites such as the Internet Archive
(
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Farchive.org%2Fweb%2F&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cf7c2f59fb41a428f031508d850d3155d%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=WMAmvrqq5KY%2BIWPeXC25PL30wxtW6j%2Bc5TQjfqNs%2Fw8%3D&reserved=0
. It contacts an EPrints server, obtains the
list of documents it serves (optionally filtered based on such things as
modification date), determines the document URLs, extracts additional
URLs by scraping pages under the "/view" section of the public site, and
finally, sends the collected URLs to web archives. Use-cases include
archiving an server content ahead of migration to another system, and
preserving contents in independent third-party archives.
The program is written in Python 3 and works over a network using an
EPrints server's REST API and normal HTTP. eprints2archives can work
with EPrints servers that require logins as well as those that allow
anonymous access. It uses parallel threads by default, transparently
handles rate limits, and robustly deals with network errors. Currently,
it can send contents to the Internet Archive and Archive.Today; more
destination archives may be added in the future.
You can install eprints2archives from PyPI or GitHub. For more
information, please visit
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcaltechlibrary%2Feprints2archives&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cf7c2f59fb41a428f031508d850d3155d%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=kQGjjBTg2R9a6VGHKHd3C636mMciD%2BErXCbMtAv2Y3I%3D&reserved=0
Please report problems using the issue tracking system, which you can
find at the GitHub link above.
Best regards,
MH
--
Mike Hucka, Ph.D. -- mhucka at caltech.edu --
https://eur03.safelinks.protection.outlook.com/?url=http:%2F%2Fwww.cds.caltech.edu%2F~mhucka&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cf7c2f59fb41a428f031508d850d3155d%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=Y0csLC5KhiGO9OUL7fme2AKwWvcDFesbx6vhYSgl7I0%3D&reserved=0
California Institute of Technology
*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cf7c2f59fb41a428f031508d850d3155d%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=UiLBnILOpAqI95wgLiTZtXNfogoqwyHhgUNBwff%2B6lQ%3D&reserved=0
*** EPrints community wiki: https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cf7c2f59fb41a428f031508d850d3155d%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=EaO94NxGvu0uJJTwzonyX8eZw4r7Wtb5i0n8214tqF0%3D&reserved=0
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20200904/8a4594ee/attachment.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
Url : http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20200904/8a4594ee/attachment.gif