EPrints Technical Mailing List Archive

Message: #08311


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

[EP-tech] Announcing eprints2archives


Greetings,

eprints2archives is a new program to archive the web pages of an EPrints server in public web archiving sites such as the Internet Archive (https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Farchive.org%2Fweb%2F&amp;data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C41e51a3fbdfc41863aa608d850383bf1%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&amp;sdata=9tL3Umw2cZzUq%2Fc4m80fu5cApqBpe7E44yooEqKEjT0%3D&amp;reserved=0. It contacts an EPrints server, obtains the list of documents it serves (optionally filtered based on such things as modification date), determines the document URLs, extracts additional URLs by scraping pages under the "/view" section of the public site, and finally, sends the collected URLs to web archives. Use-cases include archiving an server content ahead of migration to another system, and preserving contents in independent third-party archives.

The program is written in Python 3 and works over a network using an EPrints server's REST API and normal HTTP. eprints2archives can work with EPrints servers that require logins as well as those that allow anonymous access. It uses parallel threads by default, transparently handles rate limits, and robustly deals with network errors. Currently, it can send contents to the Internet Archive and Archive.Today; more destination archives may be added in the future.

You can install eprints2archives from PyPI or GitHub. For more information, please visit

  https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcaltechlibrary%2Feprints2archives&amp;data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C41e51a3fbdfc41863aa608d850383bf1%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&amp;sdata=3W2KnGoczqNuOIcrjrqwlV8ocNYe4FsTq%2Bfv%2Fz%2F%2FB5Q%3D&amp;reserved=0

Please report problems using the issue tracking system, which you can find at the GitHub link above.

Best regards,
MH
--
Mike Hucka, Ph.D. -- mhucka@caltech.edu -- https://eur03.safelinks.protection.outlook.com/?url=http:%2F%2Fwww.cds.caltech.edu%2F~mhucka&amp;data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C41e51a3fbdfc41863aa608d850383bf1%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&amp;sdata=2hYDHRLzhKXrA1ZmKF9oYbrLKTPVnpCZonFrwkp4V%2FY%3D&amp;reserved=0
California Institute of Technology