EPrints Technical Mailing List Archive

Message: #06384


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

Re: [EP-tech] Scripted XML download?


Have you tried a commane-line export.  Even if it takes a while, as long as it doesn't consume too many system resources then your repository will still be nice and snappy.  You could, for example, trigger it to run at 1am, and write the export to a location in your html directory, then wget it a day later (just in case it runs longer).  You could speed up wgetting by zipping it

the command would be:

<eprints_root>/bin/export <repositoryid> archive XML | gzip > <eprints_root>/archives/<archive_id>/htm/en/eprint_archive.xml.gzip

wget would be:

wget <base_url>/eprint_archive.xml.gzip | gunzip > eprint_archive.xml


Note that there shouldn't be any security issues because the archive dataset is the live items, so it should be all publicly visible anyway.  Also, be careful that you aren't downloading it at the time your regenerating it.

Lastly, the above was typed directly into the email -- your mileage may vary both with syntax and conceptual errors.


--
Adam Field

On 27 Mar 2017, at 14:51, Andy Reid <Andy.Reid@lshtm.ac.uk> wrote:

Hi,

I do some checking, analysis and visualisation of our repository in a third-party package, and I have it set up to ingest Eprints XML.  I’d like to update this once a week or so, but if I download it all in one big go it takes about 3 hours, 1.5GB, and tends to fail halfway in.  I have been doing it manually one year at a time, but that means 17 separate manual search-and-download operations, each taking ten minutes or so.  I don’t have shell access to the server, so can’t script it command-line. 

 

I have looked at the search page but after a search, the download form references a cached search id so I can’t just copy the URL in the download form. 

 

Can anyone give me a template for a URL that would work in a single pass in wget or libwww,  that I could then cron to fetch the EPXML ?  Obviously I have to be able to authenticate as well…  ?

 

Andy Reid

Research Information Manager

Executive Office, Room G40a

London School of Hygiene and Tropical Medicine

Keppel St, LONDON, WC1E 7HT

0207-927-2618 (Internal/Teleworker x2618)

 

*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: http://www.eprints.org/tech.php/
*** EPrints community wiki: http://wiki.eprints.org/
*** EPrints developers Forum: http://forum.eprints.org/