EPrints Technical Mailing List Archive

Message: #08312


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

[EP-tech] Antwort: Announcing eprints2archives


Hi Michael,

thank you for this initiative.

In what sort is your application a replacement for the harvesting by archive.org?

We observe the bot@archive.org bot visiting in waves our repo, sometimes harvesting more than one million pages per month. The bot does not respect robots.txt (which in a default EPrints installation would block /cgi/ to bots) due to various reasons (see https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/ ), so also harvesting data in all the various export plugin formats. We are not sure whether this is a good idea, because a website owner will have good reasons to protect certain parts of his site. But it is as it is with archive.org.

On another perspective, we think that offering browse views /view/* is outdated (corresponds to the web of the 90ies), just generates strain on the server (the job for creating the views for our 400K author list took >1.5 days, the pages filled GBs of disk space)  without much use for the end user (who drills through lists of either 10K publications per year or 15K authors per letter in the alphabet?), with limited use for bots - they get just x variants to get to the same boring eprint and so generate unnecessary traffic which has to filtered out for statistics - and creates a high potential for attacks by bad behaving bots. Offering a good sitemap.xml for bots, replacing lists with lookup (we did so for the authors), and facetted search provide a much improved experience.

Kind regards,

Martin


Inactive hide details for "Michael Hucka via Eprints-tech" ---03/09/2020 20:37:57---Greetings, eprints2archives is a new progra"Michael Hucka via Eprints-tech" ---03/09/2020 20:37:57---Greetings, eprints2archives is a new program to archive the web pages of an EPrints

Von: "Michael Hucka via Eprints-tech" <eprints-tech@ecs.soton.ac.uk>
An: eprints-tech@ecs.soton.ac.uk
Datum: 03/09/2020 20:37
Betreff: [EP-tech] Announcing eprints2archives
Gesendet von: <eprints-tech-bounces@ecs.soton.ac.uk>





Greetings,

eprints2archives is a new program to archive the web pages of an EPrints
server in public web archiving sites such as the Internet Archive
(
https://eur03.safelinks.protection.outlook.com/?url="">.  It contacts an EPrints server, obtains the
list of documents it serves (optionally filtered based on such things as
modification date), determines the document URLs, extracts additional
URLs by scraping pages under the "/view" section of the public site, and
finally, sends the collected URLs to web archives.  Use-cases include
archiving an server content ahead of migration to another system, and
preserving contents in independent third-party archives.

The program is written in Python 3 and works over a network using an
EPrints server's REST API and normal HTTP.  eprints2archives can work
with EPrints servers that require logins as well as those that allow
anonymous access.  It uses parallel threads by default, transparently
handles rate limits, and robustly deals with network errors.  Currently,
it can send contents to the Internet Archive and Archive.Today; more
destination archives may be added in the future.

You can install eprints2archives from PyPI or GitHub.  For more
information, please visit

 
https://eur03.safelinks.protection.outlook.com/?url="">

Please report problems using the issue tracking system, which you can
find at the GitHub link above.

Best regards,
MH
--
Mike Hucka, Ph.D. -- mhucka@caltech.edu --
https://eur03.safelinks.protection.outlook.com/?url="">
California Institute of Technology
*** Options:
http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive:
http://www.eprints.org/tech.php/
*** EPrints community wiki:
http://wiki.eprints.org/