[EP-tech] Antwort: Announcing eprints2archives

Michael Hucka mhucka at library.caltech.edu
Mon Sep 7 19:51:11 BST 2020


Hi,

Thanks for your questions and comments.

> In what sort is your application a replacement for the harvesting by
> archive.org?

The README file in the section "Relationship to other similar tools" 
(https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcaltechlibrary%2Feprints2archives%23relationships-to-other-similar-tools&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C5d118eed75274b44429e08d8535effec%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=kerzYMreAwCp5hc6F5GNm491sMges2JHoEPZ7EI2Y08%3D&reserved=0) 
has some discussion of this, but basically, here is a summary of some 
similarities and differences:

  1. IA's regular crawlers and/or Archive-It service could be used to 
crawl an entire EPrints website, and with some work, could also be more 
selective in the URLs it captures.  By contrast, eprints2archives is 
focused on EPrint record (article) pages, and it offers simpler and more 
direct options to control what it harvests.

  2. IA's crawlers can't be told to do things like "save the pages of 
all records that have a last-modification date newer than ABC"; 
eprints2archives can.

  3. Eprints2archives asks EPrints servers for the `official_url` field 
value (if the field exists in the records), which may or may not be 
visible on the EPrints server's pages.

  4. You control eprints2archives' schedule directly (by deciding when 
to run it), whereas scheduling IA's services is more "fuzzy".  This may 
be useful, for example, if you want to have a regular process that runs 
eprints2archives with the --lastmod option to save modified records on a 
weekly basis.

  5. eprints2archives can send content to other archives besides IA.

> We observe the bot at archive.org bot visiting in waves our repo, 
> sometimes
> harvesting more than one million pages per month. The bot does not 
> respect
> robots.txt (which in a default EPrints installation would block /cgi/ 
> to
> bots) due to various reasons (see
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fblog.archive.org%2F2017%2F04%2F17%2Frobots-txt-meant-for-search-engines-dont-work-well-for-web-archives%2F&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C5d118eed75274b44429e08d8535effec%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=A4wBcl5SJWXWq1qv11zq7K3jYHPoy9Sy6GF8No4nHGQ%3D&reserved=0
>  ), so also harvesting data in all the various export plugin formats. 
> We
> are not sure whether this is a good idea, because a website owner will 
> have
> good reasons to protect certain parts of his site. But it is as it is 
> with
> archive.org.

I'm not sure of the sense in which "protect" is intended in the 
paragraph above.  Let me say that although eprints2archives gets data 
from an EPrints server directly, the visibility of the URLs it sends to 
web archives is entirely dependent on the public visibility of the 
pages.  In other words, the pages archived by IA via eprints2archives 
can only be the pages that IA can actually see.  If a site owner wants 
to protect something, hopefully they do so by not making the pages 
publicly visible in the first place?

> On another perspective, we think that offering browse views /view/* is
> outdated (corresponds to the web of the 90ies), just generates strain 
> on
> the server (the job for creating the views for our 400K author list 
> took
>> 1.5 days, the pages filled GBs of disk space)  without much use for 
>> the
> end user (who drills through lists of either 10K publications per year 
> or
> 15K authors per letter in the alphabet?), with limited use for bots - 
> they
> get just x variants to get to the same boring eprint and so generate
> unnecessary traffic which has to filtered out for statistics - and 
> creates
> a high potential for attacks by bad behaving bots. Offering a good
> sitemap.xml for bots, replacing lists with lookup (we did so for the
> authors), and facetted search provide a much improved experience.

Yeah, it's true that there are a lot of variant URLs being gathered up 
by eprints2archives.  (At least 3 for every record -- c.f. 
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcaltechlibrary%2Feprints2archives%23urls-for-individual-eprints-records&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C5d118eed75274b44429e08d8535effec%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=GXSUuA63G8%2FX4A4bxwmjZKwOZgKvhr43d1D2BizUH5g%3D&reserved=0)

In our case, we found that IA's coverage was quite incomplete, and in 
addition, we are working on migrating to a different presentation 
system; for these reasons, we felt it would be a good idea to capture 
the current versions of our EPrints sites as completely as possible.

However, I would welcome some guidance about this.  In the case of 
Caltech's EPrints servers, we have /view pages, and clicking on the 
links under "browse" on the front page sends the user to pages under 
/view/, so I included them in what eprints2archives gathers.   Maybe 
this too much for most situations.  If people would like to suggest 
refinements to the approach described in the section at 
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcaltechlibrary%2Feprints2archives%23urls-for-individual-eprints-records&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C5d118eed75274b44429e08d8535effec%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=GXSUuA63G8%2FX4A4bxwmjZKwOZgKvhr43d1D2BizUH5g%3D&reserved=0 
I will take them into consideration.

Best regards,
MH
--
Mike Hucka, Ph.D. -- mhucka at caltech.edu -- 
https://eur03.safelinks.protection.outlook.com/?url=http:%2F%2Fwww.cds.caltech.edu%2F~mhucka&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C5d118eed75274b44429e08d8535effec%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=Y4r0Daz0aGXj%2BUwP93bU63qQIb8sOAqGvwvsQW%2BC0ao%3D&reserved=0
California Institute of Technology



More information about the Eprints-tech mailing list