[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[EP-tech] Antwort: Announcing eprints2archives
Hi,
Thanks for your questions and comments.
> In what sort is your application a replacement for the harvesting by
> archive.org?
The README file in the section "Relationship to other similar tools"
(https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcaltechlibrary%2Feprints2archives%23relationships-to-other-similar-tools&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C5d118eed75274b44429e08d8535effec%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=kerzYMreAwCp5hc6F5GNm491sMges2JHoEPZ7EI2Y08%3D&reserved=0)
has some discussion of this, but basically, here is a summary of some
similarities and differences:
1. IA's regular crawlers and/or Archive-It service could be used to
crawl an entire EPrints website, and with some work, could also be more
selective in the URLs it captures. By contrast, eprints2archives is
focused on EPrint record (article) pages, and it offers simpler and more
direct options to control what it harvests.
2. IA's crawlers can't be told to do things like "save the pages of
all records?that have a last-modification date newer than ABC";
eprints2archives can.
3. Eprints2archives asks EPrints servers for the `official_url` field
value (if the field exists in the records), which may or may not be
visible on the EPrints server's pages.
4. You control eprints2archives' schedule directly (by deciding when
to run it), whereas scheduling IA's services is more "fuzzy". This may
be useful, for example, if you want to have a regular process that runs
eprints2archives with the --lastmod option to save modified records on a
weekly basis.
5. eprints2archives can send content to other archives besides IA.
> We observe the bot at archive.org bot visiting in waves our repo,
> sometimes
> harvesting more than one million pages per month. The bot does not
> respect
> robots.txt (which in a default EPrints installation would block /cgi/
> to
> bots) due to various reasons (see
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fblog.archive.org%2F2017%2F04%2F17%2Frobots-txt-meant-for-search-engines-dont-work-well-for-web-archives%2F&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C5d118eed75274b44429e08d8535effec%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=A4wBcl5SJWXWq1qv11zq7K3jYHPoy9Sy6GF8No4nHGQ%3D&reserved=0
> ), so also harvesting data in all the various export plugin formats.
> We
> are not sure whether this is a good idea, because a website owner will
> have
> good reasons to protect certain parts of his site. But it is as it is
> with
> archive.org.
I'm not sure of the sense in which "protect" is intended in the
paragraph above. Let me say that although eprints2archives gets data
from an EPrints server directly, the visibility of the URLs it sends to
web archives is entirely dependent on the public visibility of the
pages. In other words, the pages archived by IA via eprints2archives
can only be the pages that IA can actually see. If a site owner wants
to protect something, hopefully they do so by not making the pages
publicly visible in the first place?
> On another perspective, we think that offering browse views /view/* is
> outdated (corresponds to the web of the 90ies), just generates strain
> on
> the server (the job for creating the views for our 400K author list
> took
>> 1.5 days, the pages filled GBs of disk space) without much use for
>> the
> end user (who drills through lists of either 10K publications per year
> or
> 15K authors per letter in the alphabet?), with limited use for bots -
> they
> get just x variants to get to the same boring eprint and so generate
> unnecessary traffic which has to filtered out for statistics - and
> creates
> a high potential for attacks by bad behaving bots. Offering a good
> sitemap.xml for bots, replacing lists with lookup (we did so for the
> authors), and facetted search provide a much improved experience.
Yeah, it's true that there are a lot of variant URLs being gathered up
by eprints2archives. (At least 3 for every record -- c.f.
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcaltechlibrary%2Feprints2archives%23urls-for-individual-eprints-records&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C5d118eed75274b44429e08d8535effec%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=GXSUuA63G8%2FX4A4bxwmjZKwOZgKvhr43d1D2BizUH5g%3D&reserved=0)
In our case, we found that IA's coverage was quite incomplete, and in
addition, we are working on migrating to a different presentation
system; for these reasons, we felt it would be a good idea to capture
the current versions of our EPrints sites as completely as possible.
However, I would welcome some guidance about this. In the case of
Caltech's EPrints servers, we have /view pages, and clicking on the
links under "browse" on the front page sends the user to pages under
/view/, so I included them in what eprints2archives gathers. Maybe
this too much for most situations. If people would like to suggest
refinements to the approach described in the section at
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fcaltechlibrary%2Feprints2archives%23urls-for-individual-eprints-records&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C5d118eed75274b44429e08d8535effec%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=GXSUuA63G8%2FX4A4bxwmjZKwOZgKvhr43d1D2BizUH5g%3D&reserved=0
I will take them into consideration.
Best regards,
MH
--
Mike Hucka, Ph.D. -- mhucka at caltech.edu --
https://eur03.safelinks.protection.outlook.com/?url=http:%2F%2Fwww.cds.caltech.edu%2F~mhucka&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C5d118eed75274b44429e08d8535effec%7C4a5378f929f44d3ebe89669d03ada9d8%7C0&sdata=Y4r0Daz0aGXj%2BUwP93bU63qQIb8sOAqGvwvsQW%2BC0ao%3D&reserved=0
California Institute of Technology