[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[EP-tech] Announcing eprints2archives
eprints2archives is a new program to archive the web pages of an EPrints
server in public web archiving sites such as the Internet Archive
(https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Farchive.org%2Fweb%2F&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C41e51a3fbdfc41863aa608d850383bf1%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=9tL3Umw2cZzUq%2Fc4m80fu5cApqBpe7E44yooEqKEjT0%3D&reserved=0. It contacts an EPrints server, obtains the
list of documents it serves (optionally filtered based on such things as
modification date), determines the document URLs, extracts additional
URLs by scraping pages under the "/view" section of the public site, and
finally, sends the collected URLs to web archives. Use-cases include
archiving an server content ahead of migration to another system, and
preserving contents in independent third-party archives.
The program is written in Python 3 and works over a network using an
EPrints server's REST API and normal HTTP. eprints2archives can work
with EPrints servers that require logins as well as those that allow
anonymous access. It uses parallel threads by default, transparently
handles rate limits, and robustly deals with network errors. Currently,
it can send contents to the Internet Archive and Archive.Today; more
destination archives may be added in the future.
You can install eprints2archives from PyPI or GitHub. For more
information, please visit
Please report problems using the issue tracking system, which you can
find at the GitHub link above.
Mike Hucka, Ph.D. -- mhucka at caltech.edu --
California Institute of Technology