EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #06384

Re: [EP-tech] Scripted XML download?

To: eprints-tech@ecs.soton.ac.uk
Subject: Re: [EP-tech] Scripted XML download?
From: Adam Field <af05v@ecs.soton.ac.uk>
Date: Mon, 27 Mar 2017 22:39:54 +0100

Have you tried a commane-line export. Even if it takes a while, as long as it doesn't consume too many system resources then your repository will still be nice and snappy. You could, for example, trigger it to run at 1am, and write the export to a location in your html directory, then wget it a day later (just in case it runs longer). You could speed up wgetting by zipping it

the command would be:

<eprints_root>/bin/export <repositoryid> archive XML | gzip > <eprints_root>/archives/<archive_id>/htm/en/eprint_archive.xml.gzip

wget would be:

wget <base_url>/eprint_archive.xml.gzip | gunzip > eprint_archive.xml

Note that there shouldn't be any security issues because the archive dataset is the live items, so it should be all publicly visible anyway. Also, be careful that you aren't downloading it at the time your regenerating it.

Lastly, the above was typed directly into the email -- your mileage may vary both with syntax and conceptual errors.

--
Adam Field

On 27 Mar 2017, at 14:51, Andy Reid <Andy.Reid@lshtm.ac.uk> wrote:

Hi,
I do some checking, analysis and visualisation of our repository in a third-party package, and I have it set up to ingest Eprints XML. I’d like to update this once a week or so, but if I download it all in one big go it takes about 3 hours, 1.5GB, and tends to fail halfway in. I have been doing it manually one year at a time, but that means 17 separate manual search-and-download operations, each taking ten minutes or so. I don’t have shell access to the server, so can’t script it command-line.

I have looked at the search page but after a search, the download form references a cached search id so I can’t just copy the URL in the download form.

Can anyone give me a template for a URL that would work in a single pass in wget or libwww, that I could then cron to fetch the EPXML ? Obviously I have to be able to authenticate as well… ?

Andy Reid
Research Information Manager
Executive Office, Room G40a
London School of Hygiene and Tropical Medicine
Keppel St, LONDON, WC1E 7HT
0207-927-2618 (Internal/Teleworker x2618)

*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: http://www.eprints.org/tech.php/
*** EPrints community wiki: http://wiki.eprints.org/
*** EPrints developers Forum: http://forum.eprints.org/

References:
- [EP-tech] Scripted XML download?
  - From: Andy Reid <Andy.REID@lshtm.ac.uk>

Prev by Date: Re: [EP-tech] Apache log getting a lot of errors and Mysql Going away
Next by Date: Re: [EP-tech] Scripted XML download?
Previous by thread: [EP-tech] Scripted XML download?
Next by thread: Re: [EP-tech] Scripted XML download?
Index(es):
- Date
- Thread