EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #08314

Re: [EP-tech] Antwort: Announcing eprints2archives

To: martin.braendle@uzh.ch
Subject: Re: [EP-tech] Antwort: Announcing eprints2archives
From: "Michael Hucka" <mhucka@library.caltech.edu>
Date: Mon, 07 Sep 2020 11:51:11 -0700

Hi,

Thanks for your questions and comments.

In what sort is your application a replacement for the harvesting by
archive.org?

The README file in the section "Relationship to other similar tools"(https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcaltechlibrary%2Feprints2archives%23relationships-to-other-similar-tools&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C5d118eed75274b44429e08d8535effec%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=kerzYMreAwCp5hc6F5GNm491sMges2JHoEPZ7EI2Y08%3D&reserved=0)has some discussion of this, but basically, here is a summary of somesimilarities and differences:

1. IA's regular crawlers and/or Archive-It service could be used tocrawl an entire EPrints website, and with some work, could also be moreselective in the URLs it captures. By contrast, eprints2archives isfocused on EPrint record (article) pages, and it offers simpler and moredirect options to control what it harvests.

2. IA's crawlers can't be told to do things like "save the pages ofall records that have a last-modification date newer than ABC";eprints2archives can.

3. Eprints2archives asks EPrints servers for the `official_url` fieldvalue (if the field exists in the records), which may or may not bevisible on the EPrints server's pages.

4. You control eprints2archives' schedule directly (by deciding whento run it), whereas scheduling IA's services is more "fuzzy". This maybe useful, for example, if you want to have a regular process that runseprints2archives with the --lastmod option to save modified records on aweekly basis.


 5. eprints2archives can send content to other archives besides IA.

We observe the bot@archive.org bot visiting in waves our repo,sometimesharvesting more than one million pages per month. The bot does notrespectrobots.txt (which in a default EPrints installation would block /cgi/to
bots) due to various reasons (see
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fblog.archive.org%2F2017%2F04%2F17%2Frobots-txt-meant-for-search-engines-dont-work-well-for-web-archives%2F&amp;data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C5d118eed75274b44429e08d8535effec%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&amp;sdata=A4wBcl5SJWXWq1qv11zq7K3jYHPoy9Sy6GF8No4nHGQ%3D&amp;reserved=0
), so also harvesting data in all the various export plugin formats.Weare not sure whether this is a good idea, because a website owner willhavegood reasons to protect certain parts of his site. But it is as it iswith
archive.org.

I'm not sure of the sense in which "protect" is intended in theparagraph above. Let me say that although eprints2archives gets datafrom an EPrints server directly, the visibility of the URLs it sends toweb archives is entirely dependent on the public visibility of thepages. In other words, the pages archived by IA via eprints2archivescan only be the pages that IA can actually see. If a site owner wantsto protect something, hopefully they do so by not making the pagespublicly visible in the first place?

On another perspective, we think that offering browse views /view/* is
outdated (corresponds to the web of the 90ies), just generates strainonthe server (the job for creating the views for our 400K author listtook
1.5 days, the pages filled GBs of disk space) without much use forthe
end user (who drills through lists of either 10K publications per yearor15K authors per letter in the alphabet?), with limited use for bots -they
get just x variants to get to the same boring eprint and so generate
unnecessary traffic which has to filtered out for statistics - andcreates
a high potential for attacks by bad behaving bots. Offering a good
sitemap.xml for bots, replacing lists with lookup (we did so for the
authors), and facetted search provide a much improved experience.

Yeah, it's true that there are a lot of variant URLs being gathered upby eprints2archives. (At least 3 for every record -- c.f.https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcaltechlibrary%2Feprints2archives%23urls-for-individual-eprints-records&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C5d118eed75274b44429e08d8535effec%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=GXSUuA63G8%2FX4A4bxwmjZKwOZgKvhr43d1D2BizUH5g%3D&reserved=0)

In our case, we found that IA's coverage was quite incomplete, and inaddition, we are working on migrating to a different presentationsystem; for these reasons, we felt it would be a good idea to capturethe current versions of our EPrints sites as completely as possible.

However, I would welcome some guidance about this. In the case ofCaltech's EPrints servers, we have /view pages, and clicking on thelinks under "browse" on the front page sends the user to pages under/view/, so I included them in what eprints2archives gathers. Maybethis too much for most situations. If people would like to suggestrefinements to the approach described in the section athttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcaltechlibrary%2Feprints2archives%23urls-for-individual-eprints-records&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C5d118eed75274b44429e08d8535effec%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=GXSUuA63G8%2FX4A4bxwmjZKwOZgKvhr43d1D2BizUH5g%3D&reserved=0I will take them into consideration.


Best regards,
MH
--

Mike Hucka, Ph.D. -- mhucka@caltech.edu --https://eur03.safelinks.protection.outlook.com/?url=http:%2F%2Fwww.cds.caltech.edu%2F~mhucka&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C5d118eed75274b44429e08d8535effec%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=Y4r0Daz0aGXj%2BUwP93bU63qQIb8sOAqGvwvsQW%2BC0ao%3D&reserved=0

California Institute of Technology

Follow-Ups:
- Re: [EP-tech] Antwort: Announcing eprints2archives
  - From: "Michael Hucka" <mhucka@library.caltech.edu>

References:
- [EP-tech] Announcing eprints2archives
  - From: "Michael Hucka" <mhucka@library.caltech.edu>
- [EP-tech] Antwort: Announcing eprints2archives
  - From: <martin.braendle@uzh.ch>
- Re: [EP-tech] Antwort: Announcing eprints2archives
  - From: "Michael Hucka" <mhucka@library.caltech.edu>

Prev by Date: [EP-tech] General purpose CSV import
Next by Date: [EP-tech] Configuring Azure Web Application Firewall (WAF) for eprints
Previous by thread: [EP-tech] EPrints/CRIS
Next by thread: [EP-tech] DOI handling in orcid_support_advance
Index(es):
- Date
- Thread