EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #08312

[EP-tech] Antwort: Announcing eprints2archives

To: <eprints-tech@ecs.soton.ac.uk>, Michael Hucka <mhucka@library.caltech.edu>
Subject: [EP-tech] Antwort: Announcing eprints2archives
From: <martin.braendle@uzh.ch>
Date: Fri, 4 Sep 2020 15:04:33 +0200

Hi Michael,

thank you for this initiative.

In what sort is your application a replacement for the harvesting by archive.org?

We observe the bot@archive.org bot visiting in waves our repo, sometimes harvesting more than one million pages per month. The bot does not respect robots.txt (which in a default EPrints installation would block /cgi/ to bots) due to various reasons (see https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/ ), so also harvesting data in all the various export plugin formats. We are not sure whether this is a good idea, because a website owner will have good reasons to protect certain parts of his site. But it is as it is with archive.org.

On another perspective, we think that offering browse views /view/* is outdated (corresponds to the web of the 90ies), just generates strain on the server (the job for creating the views for our 400K author list took >1.5 days, the pages filled GBs of disk space) without much use for the end user (who drills through lists of either 10K publications per year or 15K authors per letter in the alphabet?), with limited use for bots - they get just x variants to get to the same boring eprint and so generate unnecessary traffic which has to filtered out for statistics - and creates a high potential for attacks by bad behaving bots. Offering a good sitemap.xml for bots, replacing lists with lookup (we did so for the authors), and facetted search provide a much improved experience.

Kind regards,

Martin

"Michael Hucka via Eprints-tech" ---03/09/2020 20:37:57---Greetings, eprints2archives is a new program to archive the web pages of an EPrints

Von: "Michael Hucka via Eprints-tech" <eprints-tech@ecs.soton.ac.uk>
An: eprints-tech@ecs.soton.ac.uk
Datum: 03/09/2020 20:37
Betreff: [EP-tech] Announcing eprints2archives
Gesendet von: <eprints-tech-bounces@ecs.soton.ac.uk>

Greetings, eprints2archives is a new program to archive the web pages of an EPrints server in public web archiving sites such as the Internet Archive (https://eur03.safelinks.protection.outlook.com/?url="">. It contacts an EPrints server, obtains the list of documents it serves (optionally filtered based on such things as modification date), determines the document URLs, extracts additional URLs by scraping pages under the "/view" section of the public site, and finally, sends the collected URLs to web archives. Use-cases include archiving an server content ahead of migration to another system, and preserving contents in independent third-party archives. The program is written in Python 3 and works over a network using an EPrints server's REST API and normal HTTP. eprints2archives can work with EPrints servers that require logins as well as those that allow anonymous access. It uses parallel threads by default, transparently handles rate limits, and robustly deals with network errors. Currently, it can send contents to the Internet Archive and Archive.Today; more destination archives may be added in the future. You can install eprints2archives from PyPI or GitHub. For more information, please visit https://eur03.safelinks.protection.outlook.com/?url="">Please report problems using the issue tracking system, which you can find at the GitHub link above. Best regards, MH -- Mike Hucka, Ph.D. -- mhucka@caltech.edu --https://eur03.safelinks.protection.outlook.com/?url="">California Institute of Technology *** Options:http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech*** Archive:http://www.eprints.org/tech.php/*** EPrints community wiki:http://wiki.eprints.org/

Follow-Ups:
- [EP-tech] Antwort: Announcing eprints2archives
  - From: <martin.braendle@uzh.ch>

References:
- [EP-tech] Announcing eprints2archives
  - From: "Michael Hucka" <mhucka@library.caltech.edu>
- [EP-tech] Antwort: Announcing eprints2archives
  - From: <martin.braendle@uzh.ch>

Prev by Date: [EP-tech] Announcing eprints2archives
Next by Date: [EP-tech] General purpose CSV import
Previous by thread: [EP-tech] EPrints/CRIS
Next by thread: [EP-tech] DOI handling in orcid_support_advance
Index(es):
- Date
- Thread