<html><body>
<p><font size="2" face="sans-serif">Hi Michael,</font><br>
<br>
<font size="2" face="sans-serif">thank you for this initiative.</font><br>
<br>
<font size="2" face="sans-serif">In what sort is your application a replacement for the harvesting by archive.org?</font><br>
<br>
<font size="2" face="sans-serif">We observe the </font><font size="2" face="Menlo-Regular">bot@archive.org</font><font size="2" face="sans-serif"> bot visiting in waves our repo, sometimes harvesting more than one million pages per month. The bot does not respect robots.txt (which in a default EPrints installation would block /cgi/ to bots) due to various reasons (see </font><a href="https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fblog.archive.org%2F2017%2F04%2F17%2Frobots-txt-meant-for-search-engines-dont-work-well-for-web-archives%2F&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cf7c2f59fb41a428f031508d850d3155d%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=mnOIdOlbX1r4bKGFvr3adjvKOceIOkH95fmKFaPoxeA%3D&reserved=0" originalSrc="https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/" shash="Ya0zxWf9zlqTpJxoMtGR69JOUiITK1LqRgI45D8ihnJkje7Rag+F1brrN3EJzk27hQnCW3SNs6LNQON6L8xnnQpuWAAZqssBeVGeT6scHYg4xka9/Y2A49jjLYtHOIAmNyAUlT0IWZpOOtGoDPeeIHh0LHrb+S0TZMp/7yLDk0g="><font size="2" face="sans-serif">https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-engines-dont-work-well-for-web-archives/</font></a><font size="2" face="sans-serif"> ), so also harvesting data in all the various export plugin formats. We are not sure whether this is a good idea, because a website owner will have good reasons to protect certain parts of his site. But it is as it is with archive.org.</font><br>
<br>
<font size="2" face="sans-serif">On another perspective, we think that offering browse views /view/* is outdated (corresponds to the web of the 90ies), just generates strain on the server (the job for creating the views for our 400K author list took >1.5 days, the pages filled GBs of disk space) without much use for the end user (who drills through lists of either 10K publications per year or 15K authors per letter in the alphabet?), with limited use for bots - they get just x variants to get to the same boring eprint and so generate unnecessary traffic which has to filtered out for statistics - and creates a high potential for attacks by bad behaving bots. Offering a good sitemap.xml for bots, replacing lists with lookup (we did so for the authors), and facetted search provide a much improved experience.</font><br>
<br>
<font size="2" face="sans-serif">Kind regards,</font><br>
<br>
<font size="2" face="sans-serif">Martin</font><br>
<br>
<br>
<img width="16" height="16" src="cid:1__=4EBB0F4ADFD11BC58f9e8a93df9@lotus.uzh.ch" border="0" alt="Inactive hide details for "Michael Hucka via Eprints-tech" ---03/09/2020 20:37:57---Greetings, eprints2archives is a new progra"><font size="2" color="#424282" face="sans-serif">"Michael Hucka via Eprints-tech" ---03/09/2020 20:37:57---Greetings, eprints2archives is a new program to archive the web pages of an EPrints</font><br>
<br>
<font size="1" color="#5F5F5F" face="sans-serif">Von:        </font><font size="1" face="sans-serif">"Michael Hucka via Eprints-tech" <eprints-tech@ecs.soton.ac.uk></font><br>
<font size="1" color="#5F5F5F" face="sans-serif">An:        </font><font size="1" face="sans-serif">eprints-tech@ecs.soton.ac.uk</font><br>
<font size="1" color="#5F5F5F" face="sans-serif">Datum:        </font><font size="1" face="sans-serif">03/09/2020 20:37</font><br>
<font size="1" color="#5F5F5F" face="sans-serif">Betreff:        </font><font size="1" face="sans-serif">[EP-tech] Announcing eprints2archives</font><br>
<font size="1" color="#5F5F5F" face="sans-serif">Gesendet von:        </font><font size="1" face="sans-serif"><eprints-tech-bounces@ecs.soton.ac.uk></font><br>
<hr width="100%" size="2" align="left" noshade style="color:#8091A5; "><br>
<br>
<br>
<tt><font size="2">Greetings,<br>
<br>
eprints2archives is a new program to archive the web pages of an EPrints <br>
server in public web archiving sites such as the Internet Archive <br>
(</font></tt><tt><font size="2"><a href="https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Farchive.org%2Fweb%2F&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cf7c2f59fb41a428f031508d850d3155d%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=WMAmvrqq5KY%2BIWPeXC25PL30wxtW6j%2Bc5TQjfqNs%2Fw8%3D&reserved=0" originalSrc="https://archive.org/web/" shash="uYYudqntHDsSi9kMn4dIXGBNuLFvGy5AAfvHPux8eENrKBj1LY9CxGT/YWLFAeZetq43xCD6oK0DvDRvOYCCFiu+W/b0Q4ctgEfi1wiZ0oVxpXLinwWX9rIMTXBgb/hPSLGVYD0wX6JMqqI50toQWkf/tW4NpR78XVZNPd6VbTs=">https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Farchive.org%2Fweb%2F&amp;data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C41e51a3fbdfc41863aa608d850383bf1%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&amp;sdata=9tL3Umw2cZzUq%2Fc4m80fu5cApqBpe7E44yooEqKEjT0%3D&amp;reserved=0</a></font></tt><tt><font size="2">. It contacts an EPrints server, obtains the <br>
list of documents it serves (optionally filtered based on such things as <br>
modification date), determines the document URLs, extracts additional <br>
URLs by scraping pages under the "/view" section of the public site, and <br>
finally, sends the collected URLs to web archives. Use-cases include <br>
archiving an server content ahead of migration to another system, and <br>
preserving contents in independent third-party archives.<br>
<br>
The program is written in Python 3 and works over a network using an <br>
EPrints server's REST API and normal HTTP. eprints2archives can work <br>
with EPrints servers that require logins as well as those that allow <br>
anonymous access. It uses parallel threads by default, transparently <br>
handles rate limits, and robustly deals with network errors. Currently, <br>
it can send contents to the Internet Archive and Archive.Today; more <br>
destination archives may be added in the future.<br>
<br>
You can install eprints2archives from PyPI or GitHub. For more <br>
information, please visit<br>
<br>
</font></tt><tt><font size="2"><a href="https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcaltechlibrary%2Feprints2archives&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cf7c2f59fb41a428f031508d850d3155d%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=kQGjjBTg2R9a6VGHKHd3C636mMciD%2BErXCbMtAv2Y3I%3D&reserved=0" originalSrc="https://github.com/caltechlibrary/eprints2archives" shash="pzzTn+VFcj0R5bX52ptVIHMSXYCfLNySy49Z9wZ6J41ffvU5lUm7Fg+FQMr+evFWOBoc7FilJVp3lEi2lj9VUgaygJqUY4725u0wIx7MPj3bltVRg3oiOxmn70/TYJwr54jHA/VEbVFQ1NRKWwJ7aJZkBILuTTfRsp5bzmt+Ch8=">https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcaltechlibrary%2Feprints2archives&amp;data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C41e51a3fbdfc41863aa608d850383bf1%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&amp;sdata=3W2KnGoczqNuOIcrjrqwlV8ocNYe4FsTq%2Bfv%2Fz%2F%2FB5Q%3D&amp;reserved=0</a></font></tt><tt><font size="2"><br>
<br>
Please report problems using the issue tracking system, which you can <br>
find at the GitHub link above.<br>
<br>
Best regards,<br>
MH<br>
--<br>
Mike Hucka, Ph.D. -- mhucka@caltech.edu -- <br>
</font></tt><tt><font size="2"><a href="https://eur03.safelinks.protection.outlook.com/?url=http:%2F%2Fwww.cds.caltech.edu%2F~mhucka&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cf7c2f59fb41a428f031508d850d3155d%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=Y0csLC5KhiGO9OUL7fme2AKwWvcDFesbx6vhYSgl7I0%3D&reserved=0" originalSrc="http://www.cds.caltech.edu/~mhucka" shash="A1vpktjFSfkdg7ztp87MXF475iEElWmXMyLzXlHNVigvKjGGEwc7upYPgWzAF7eX8mY1DJPOMXNGqgXn0QYsM6Aty+mndR9gBPNP6SMbaQmnG5NsH11phUjx5hBCVlt1YoOTZuHNOI/L5Pu5ZGTbyVUZpOMYW7hHw/Z0wkerEXw=">https://eur03.safelinks.protection.outlook.com/?url=http:%2F%2Fwww.cds.caltech.edu%2F~mhucka&amp;data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C41e51a3fbdfc41863aa608d850383bf1%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&amp;sdata=2hYDHRLzhKXrA1ZmKF9oYbrLKTPVnpCZonFrwkp4V%2FY%3D&amp;reserved=0</a></font></tt><tt><font size="2"><br>
California Institute of Technology<br>
*** Options: </font></tt><tt><font size="2"><a href="http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech">http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech</a></font></tt><tt><font size="2"><br>
*** Archive: </font></tt><tt><font size="2"><a href="https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cf7c2f59fb41a428f031508d850d3155d%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=UiLBnILOpAqI95wgLiTZtXNfogoqwyHhgUNBwff%2B6lQ%3D&reserved=0" originalSrc="http://www.eprints.org/tech.php/" shash="LtUL7fEB0CqRZec7MS7B9jcbkq1PhHqpkff/ZH8U4NPvCmzFslaVFp4HWRzLQTM/wq+JJWpxCr7ks7Evitp1HSwhXRlASM+ZMWIK2Sx8DAOGHANXkNdKVknBGJv8/38ss5w4rjO3zLrdqIP5lkol3ggFrIeO+Xca5eSXfXvCmAE=">http://www.eprints.org/tech.php/</a></font></tt><tt><font size="2"><br>
*** EPrints community wiki: </font></tt><tt><font size="2"><a href="https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cf7c2f59fb41a428f031508d850d3155d%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=EaO94NxGvu0uJJTwzonyX8eZw4r7Wtb5i0n8214tqF0%3D&reserved=0" originalSrc="http://wiki.eprints.org/" shash="G7ZwKluopWpCO8+AJEimEZCOGjAB48byA5fYpDAD5BejIpzRi5litZFrPimScD7PLA7z+48o87hxCYaK61k5TFCostbvQLcuLUwVVJkpj+OQ+Fv/wc+NkKrfqONmoow8gLqFw7l2GNop9nKI9FPdTJjH1cmKUcd/f7sF4m8E+BQ=">http://wiki.eprints.org/</a></font></tt><tt><font size="2"><br>
</font></tt><br>
<br>
</body></html>