EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #09663


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

RE: [EP-tech] Sword 2.0 API upload times


CAUTION: This e-mail originated outside the University of Southampton.

Ah, I can see how that might be less efficient that e.g. posting the document itself.

 

I had a quick trace through the code – it hits some SAX parsing, but I suspect the bottleneck might be in EPrints::DataObj::File – and things doing base64_decode.


You could export some existing EPrints with various file sizes, using the ‘XMLFiles’ plugin:

~/bin/export ARCHIVEID eprint XMLFiles EPRINTID > /somewhere/EPRINTID.xml

 

Import these on your test system via the commandline, and see if there is any size-related slowness.

 

Cheers,
John

 

From: Martin Brändle <martin.braendle@uzh.ch>
Sent: Wednesday, March 6, 2024 7:03 AM
To: John Salter <J.Salter@leeds.ac.uk>; eprints-tech@ecs.soton.ac.uk
Subject: Re: [EP-tech] Sword 2.0 API upload times

 

CAUTION: External Message. Use caution opening links and attachments.

Yes,

A single EP3XML with base64 encoded files is sent.

It might well be that the decoding algorithm isn’t efficient.

I’ll check with the sender if he can change the way the files are uploaded.

Kind regards,

Martin

 

 

From: John Salter <J.Salter@leeds.ac.uk>
Date: Tuesday, 5 March 2024 at 20:23
To: Martin Brändle <martin.braendle@uzh.ch>, eprints-tech@ecs.soton.ac.uk <eprints-tech@ecs.soton.ac.uk>
Subject: Re: [EP-tech] Sword 2.0 API upload times

Just checking, in case we've overlooked something...

In your original report, you said the upload was EPXML, with embedded files.

Do you mean that the files are base64 encoded into the XML payload?

 

I don't think any testing people have done has actually done this exact thing.

The XML parser might need to cache the entire payload before decoding the base64 document.

I can see that this might have bottlenecks in the XML creation (at the sender end) and XML parsing (EPrints end).

 


From: eprints-tech-request@ecs.soton.ac.uk <eprints-tech-request@ecs.soton.ac.uk> on behalf of Martin Brändle <martin.braendle@uzh.ch>
Sent: 05 March 2024 13:12
To: eprints-tech@ecs.soton.ac.uk <eprints-tech@ecs.soton.ac.uk>
Subject: Re: [EP-tech] Sword 2.0 API upload times

 

CAUTION: External Message. Use caution opening links and attachments.

CAUTION: This e-mail originated outside the University of Southampton.

CAUTION: This e-mail originated outside the University of Southampton.

Hi David and John,

 

Thanks for testing and advice. I tested with a 607 MB mp4 file (so no unpacking required). Upload took between 1:26 and 2:06 minutes irrespective of whether I had virus checking and DROID format detection enabled or not.

For a 103MB file it took around 10-11 seconds. Reasonable on a shared WiFi network with a Tx rate of 573Mbps.

 

So there may be many factors affecting upload rate. I think I have to check with the complainer 😊.

 

Kind regards,

 

Martin

 

--

Dr. Martin Brändle
Zentrale Informatik
Universität Zürich
Pfingstweidstrasse 60B
CH-800
5 Zürich

 

 

From: eprints-tech-request@ecs.soton.ac.uk <eprints-tech-request@ecs.soton.ac.uk> on behalf of Yuri <yurj@alfa.it>
Date: Monday, 4 March 2024 at 09:24
To: eprints-tech@ecs.soton.ac.uk <eprints-tech@ecs.soton.ac.uk>
Subject: Re: [EP-tech] Sword 2.0 API upload times

CAUTION: This e-mail originated outside the University of Southampton.

CAUTION: This e-mail originated outside the University of Southampton.

Maybe pdf indexing or pdf cover can be the problem? But the process should be asyncronous via queue, right?

Il 01/03/24 17:29, David R Newman ha scritto:

Hi Martin,

I just tried uploading a 100MB using the CRUD API and this seemed to take only a few seconds (to my dev VM running EPrints 3.4 GitHub HEAD):

time curl -X POST -i -u USERNAME:PASSWORD --data-binary "@100MB.txt" -H 'Content-Disposition: attachment; filename="100MB.txt"' -H "Content-Type: text/plain" https://eprints.example.org/id/eprint/1234/contents
real    0m6.119s
user    0m0.279s
sys     0m0.278s


I confirmed that the file had uploaded successfully and downloaded it to confirm it was of the expected size.

I am not sure if there would be something within the SWORD API beyond that would do beyond what is in this Curl request, is the uploaded file a zip that needs to be unpacked?

Regards

David Newman

On 01/03/2024 3:48 pm, Martin Brändle wrote:

CAUTION: This e-mail originated outside the University of Southampton.

CAUTION: This e-mail originated outside the University of Southampton.

Dear all,

 

since very recently, one faculty of our university deposits its dissertations via Sword 2.0 API. The EP3XML with embedded PDF is deposited.

 

Everything works fine, however, the faculty observes that it takes unproportionally long the bigger the size of the PDF is, until they get process termination feedback:

 

  • 3.8 MB: 7 seconds
  • 16.5MB: 2min 30s
  • 22.6MB: 4min

 

Is such a behaviour known to you? Any adjusting screws?

 

We do some checks such as scanning for viruses or format determination using Droid. The former is done immediately in the document_validate.pl, the latter is being triggered after the document has been uploaded. So I don’t see any bottleneck in these processes.

 

Kind regards,

 

Martin

 

--

Dr. Martin Brändle
Zentrale Informatik
Universität Zürich
Pfingstweidstrasse 60B
CH-800
5 Zürich

 

 

*** Options: https://wiki.eprints.org/w/Eprints-tech_Mailing_List
*** Archive: https://www.eprints.org/tech.php/
*** EPrints community wiki: https://wiki.eprints.org/
 

 

*** Options: https://wiki.eprints.org/w/Eprints-tech_Mailing_List
*** Archive: https://www.eprints.org/tech.php/
*** EPrints community wiki: https://wiki.eprints.org/