EPrints Technical Mailing List Archive

Message: #08314


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

Re: [EP-tech] Antwort: Announcing eprints2archives


Hi,

Thanks for your questions and comments.

In what sort is your application a replacement for the harvesting by
archive.org?

The README file in the section "Relationship to other similar tools" (https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcaltechlibrary%2Feprints2archives%23relationships-to-other-similar-tools&amp;data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C5d118eed75274b44429e08d8535effec%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&amp;sdata=kerzYMreAwCp5hc6F5GNm491sMges2JHoEPZ7EI2Y08%3D&amp;reserved=0) has some discussion of this, but basically, here is a summary of some similarities and differences:

1. IA's regular crawlers and/or Archive-It service could be used to crawl an entire EPrints website, and with some work, could also be more selective in the URLs it captures. By contrast, eprints2archives is focused on EPrint record (article) pages, and it offers simpler and more direct options to control what it harvests.

2. IA's crawlers can't be told to do things like "save the pages of all records that have a last-modification date newer than ABC"; eprints2archives can.

3. Eprints2archives asks EPrints servers for the `official_url` field value (if the field exists in the records), which may or may not be visible on the EPrints server's pages.

4. You control eprints2archives' schedule directly (by deciding when to run it), whereas scheduling IA's services is more "fuzzy". This may be useful, for example, if you want to have a regular process that runs eprints2archives with the --lastmod option to save modified records on a weekly basis.

 5. eprints2archives can send content to other archives besides IA.

We observe the bot@archive.org bot visiting in waves our repo, sometimes harvesting more than one million pages per month. The bot does not respect robots.txt (which in a default EPrints installation would block /cgi/ to
bots) due to various reasons (see
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fblog.archive.org%2F2017%2F04%2F17%2Frobots-txt-meant-for-search-engines-dont-work-well-for-web-archives%2F&amp;data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C5d118eed75274b44429e08d8535effec%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&amp;sdata=A4wBcl5SJWXWq1qv11zq7K3jYHPoy9Sy6GF8No4nHGQ%3D&amp;reserved=0
), so also harvesting data in all the various export plugin formats. We are not sure whether this is a good idea, because a website owner will have good reasons to protect certain parts of his site. But it is as it is with
archive.org.

I'm not sure of the sense in which "protect" is intended in the paragraph above. Let me say that although eprints2archives gets data from an EPrints server directly, the visibility of the URLs it sends to web archives is entirely dependent on the public visibility of the pages. In other words, the pages archived by IA via eprints2archives can only be the pages that IA can actually see. If a site owner wants to protect something, hopefully they do so by not making the pages publicly visible in the first place?

On another perspective, we think that offering browse views /view/* is
outdated (corresponds to the web of the 90ies), just generates strain on the server (the job for creating the views for our 400K author list took
1.5 days, the pages filled GBs of disk space) without much use for the
end user (who drills through lists of either 10K publications per year or 15K authors per letter in the alphabet?), with limited use for bots - they
get just x variants to get to the same boring eprint and so generate
unnecessary traffic which has to filtered out for statistics - and creates
a high potential for attacks by bad behaving bots. Offering a good
sitemap.xml for bots, replacing lists with lookup (we did so for the
authors), and facetted search provide a much improved experience.

Yeah, it's true that there are a lot of variant URLs being gathered up by eprints2archives. (At least 3 for every record -- c.f. https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcaltechlibrary%2Feprints2archives%23urls-for-individual-eprints-records&amp;data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C5d118eed75274b44429e08d8535effec%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&amp;sdata=GXSUuA63G8%2FX4A4bxwmjZKwOZgKvhr43d1D2BizUH5g%3D&amp;reserved=0)

In our case, we found that IA's coverage was quite incomplete, and in addition, we are working on migrating to a different presentation system; for these reasons, we felt it would be a good idea to capture the current versions of our EPrints sites as completely as possible.

However, I would welcome some guidance about this. In the case of Caltech's EPrints servers, we have /view pages, and clicking on the links under "browse" on the front page sends the user to pages under /view/, so I included them in what eprints2archives gathers. Maybe this too much for most situations. If people would like to suggest refinements to the approach described in the section at https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcaltechlibrary%2Feprints2archives%23urls-for-individual-eprints-records&amp;data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C5d118eed75274b44429e08d8535effec%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&amp;sdata=GXSUuA63G8%2FX4A4bxwmjZKwOZgKvhr43d1D2BizUH5g%3D&amp;reserved=0 I will take them into consideration.

Best regards,
MH
--
Mike Hucka, Ph.D. -- mhucka@caltech.edu -- https://eur03.safelinks.protection.outlook.com/?url=http:%2F%2Fwww.cds.caltech.edu%2F~mhucka&amp;data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C5d118eed75274b44429e08d8535effec%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&amp;sdata=Y4r0Daz0aGXj%2BUwP93bU63qQIb8sOAqGvwvsQW%2BC0ao%3D&amp;reserved=0
California Institute of Technology