[EP-tech] Announcing eprints2archives

Michael Hucka mhucka at library.caltech.edu
Thu Sep 3 19:35:57 BST 2020


Greetings,

eprints2archives is a new program to archive the web pages of an EPrints 
server in public web archiving sites such as the Internet Archive 
(https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Farchive.org%2Fweb%2F&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C41e51a3fbdfc41863aa608d850383bf1%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=9tL3Umw2cZzUq%2Fc4m80fu5cApqBpe7E44yooEqKEjT0%3D&reserved=0.  It contacts an EPrints server, obtains the 
list of documents it serves (optionally filtered based on such things as 
modification date), determines the document URLs, extracts additional 
URLs by scraping pages under the "/view" section of the public site, and 
finally, sends the collected URLs to web archives.  Use-cases include 
archiving an server content ahead of migration to another system, and 
preserving contents in independent third-party archives.

The program is written in Python 3 and works over a network using an 
EPrints server's REST API and normal HTTP.  eprints2archives can work 
with EPrints servers that require logins as well as those that allow 
anonymous access.  It uses parallel threads by default, transparently 
handles rate limits, and robustly deals with network errors.  Currently, 
it can send contents to the Internet Archive and Archive.Today; more 
destination archives may be added in the future.

You can install eprints2archives from PyPI or GitHub.  For more 
information, please visit

   https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcaltechlibrary%2Feprints2archives&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C41e51a3fbdfc41863aa608d850383bf1%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=3W2KnGoczqNuOIcrjrqwlV8ocNYe4FsTq%2Bfv%2Fz%2F%2FB5Q%3D&reserved=0

Please report problems using the issue tracking system, which you can 
find at the GitHub link above.

Best regards,
MH
--
Mike Hucka, Ph.D. -- mhucka at caltech.edu -- 
https://eur03.safelinks.protection.outlook.com/?url=http:%2F%2Fwww.cds.caltech.edu%2F~mhucka&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C41e51a3fbdfc41863aa608d850383bf1%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=2hYDHRLzhKXrA1ZmKF9oYbrLKTPVnpCZonFrwkp4V%2FY%3D&reserved=0
California Institute of Technology


More information about the Eprints-tech mailing list