EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #09699


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

Re: [EP-tech] Sword 2.0 API upload times


CAUTION: This e-mail originated outside the University of Southampton.

Hi,

 

problem solved: after getting some files from the faculty we were able to test their files and found the culprit in their XML: The data element for the base64-encoded  file was missing the encoding=”base64” attribute. After adding it, the upload of a 30 MB file took 9 seconds instead of 90 seconds.

 

It seems like the algorithm for decoding goes into some “guess mode” if the encoding is not specified.

 

Kind regards,

 

Martin

 

--

Dr. Martin Brändle
Zentrale Informatik
Universität Zürich
Pfingstweidstrasse 60B
CH-800
5 Zürich

mail: martin.braendle@uzh.ch
phone: +41 44 63 56705
signature_2066573683https://orcid.org/0000-0002-7752-6567
https://www.zi.uzh.ch

 

 

From: eprints-tech-request@ecs.soton.ac.uk <eprints-tech-request@ecs.soton.ac.uk> on behalf of Martin Brändle <martin.braendle@uzh.ch>
Date: Wednesday, 6 March 2024 at 10:49
To: David R Newman <drn@ecs.soton.ac.uk>, eprints-tech@ecs.soton.ac.uk <eprints-tech@ecs.soton.ac.uk>, John Salter <J.Salter@leeds.ac.uk>
Subject: Re: [EP-tech] Sword 2.0 API upload times

CAUTION: This e-mail originated outside the University of Southampton.

CAUTION: This e-mail originated outside the University of Southampton.

Hi David,

 

Thanks for investigating. We have enough memory on our production servers (32 GB each), although currently they are under high load (300-400 login tickets in peak times).

The question is if there is a memory limit for the Apache process or the XML parsing. Can this be configured somewhere?

As far as I remember, there shouldn’t be any. E.g. the nightly create views processes on our compute server consume up to 18 GB without any hassle.

 

Kind regards,

 

Martin

 

--

Dr. Martin Brändle
Zentrale Informatik
Universität Zürich
Pfingstweidstrasse 60B
CH-800
5 Zürich

 

 

From: David R Newman <drn@ecs.soton.ac.uk>
Date: Wednesday, 6 March 2024 at 09:34
To: eprints-tech@ecs.soton.ac.uk <eprints-tech@ecs.soton.ac.uk>, Martin Brändle <martin.braendle@uzh.ch>, John Salter <J.Salter@leeds.ac.uk>
Subject: Re: [EP-tech] Sword 2.0 API upload times

Hi John and Martin,

I add the base64 of my 100MB test file to an EPrints XML file to upload as a new eprint and this did take longer compared uploading the raw file as an addition to an existing file:

== base64 encoded in XML upload ==
0m16.550s
0m18.164s
0m15.435s

== raw file upload ==
 0m9.969s
 0m8.706s
 0m8.261s

Whilst testing I found that certainly the former's and even the latter's times can be significantly affected by how busy the server was or probably more likely the amount of RAM available.  When testing a 10MB file there was little difference: 2 seconds vs 1.5 seconds.  However, when testing with a 500MB file there was quite a big difference 1m16s vs 33.5s. 

Looking at all these times, they are still is not the lengths of time you described in your original email.  I should add a caveat that all these tests were done running curl commands on the same server to which I was uploading.  This probably helps negate any network effects but may have given slightly more optimistic times than you might get in a real-world scenario.  That said, the important point is that the difference between raw upload and base64 XML encoded upload does not diverge exponentially at file size increases or at least not excessively so.  Admittedly, this has excluded the time it takes to base64 encode the file in the first place.  For information, this took around 6 seconds for the 500MB file on a low spec (1 CPU core / 2GB) server.

Regards

David Newman

On 06/03/2024 7:02 am, Martin Brändle wrote:

CAUTION: This e-mail originated outside the University of Southampton.

CAUTION: This e-mail originated outside the University of Southampton.

Yes,

A single EP3XML with base64 encoded files is sent.

It might well be that the decoding algorithm isn’t efficient.

I’ll check with the sender if he can change the way the files are uploaded.

Kind regards,

Martin

 

 

From: John Salter <J.Salter@leeds.ac.uk>
Date: Tuesday, 5 March 2024 at 20:23
To: Martin Brändle <martin.braendle@uzh.ch>, eprints-tech@ecs.soton.ac.uk <eprints-tech@ecs.soton.ac.uk>
Subject: Re: [EP-tech] Sword 2.0 API upload times

Just checking, in case we've overlooked something...

In your original report, you said the upload was EPXML, with embedded files.

Do you mean that the files are base64 encoded into the XML payload?

 

I don't think any testing people have done has actually done this exact thing.

The XML parser might need to cache the entire payload before decoding the base64 document.

I can see that this might have bottlenecks in the XML creation (at the sender end) and XML parsing (EPrints end).

 


From: eprints-tech-request@ecs.soton.ac.uk <eprints-tech-request@ecs.soton.ac.uk> on behalf of Martin Brändle <martin.braendle@uzh.ch>
Sent: 05 March 2024 13:12
To: eprints-tech@ecs.soton.ac.uk <eprints-tech@ecs.soton.ac.uk>
Subject: Re: [EP-tech] Sword 2.0 API upload times

 

CAUTION: External Message. Use caution opening links and attachments.

CAUTION: This e-mail originated outside the University of Southampton.

CAUTION: This e-mail originated outside the University of Southampton.

Hi David and John,

 

Thanks for testing and advice. I tested with a 607 MB mp4 file (so no unpacking required). Upload took between 1:26 and 2:06 minutes irrespective of whether I had virus checking and DROID format detection enabled or not.

For a 103MB file it took around 10-11 seconds. Reasonable on a shared WiFi network with a Tx rate of 573Mbps.

 

So there may be many factors affecting upload rate. I think I have to check with the complainer 😊.

 

Kind regards,

 

Martin

 

--

Dr. Martin Brändle
Zentrale Informatik
Universität Zürich
Pfingstweidstrasse 60B
CH-800
5 Zürich

 

 

From: eprints-tech-request@ecs.soton.ac.uk <eprints-tech-request@ecs.soton.ac.uk> on behalf of Yuri <yurj@alfa.it>
Date: Monday, 4 March 2024 at 09:24
To: eprints-tech@ecs.soton.ac.uk <eprints-tech@ecs.soton.ac.uk>
Subject: Re: [EP-tech] Sword 2.0 API upload times

CAUTION: This e-mail originated outside the University of Southampton.

CAUTION: This e-mail originated outside the University of Southampton.

Maybe pdf indexing or pdf cover can be the problem? But the process should be asyncronous via queue, right?

Il 01/03/24 17:29, David R Newman ha scritto:

Hi Martin,

I just tried uploading a 100MB using the CRUD API and this seemed to take only a few seconds (to my dev VM running EPrints 3.4 GitHub HEAD):

time curl -X POST -i -u USERNAME:PASSWORD --data-binary "@100MB.txt" -H 'Content-Disposition: attachment; filename="100MB.txt"' -H "Content-Type: text/plain" https://eprints.example.org/id/eprint/1234/contents
real    0m6.119s
user    0m0.279s
sys     0m0.278s


I confirmed that the file had uploaded successfully and downloaded it to confirm it was of the expected size.

I am not sure if there would be something within the SWORD API beyond that would do beyond what is in this Curl request, is the uploaded file a zip that needs to be unpacked?

Regards

David Newman

On 01/03/2024 3:48 pm, Martin Brändle wrote:

CAUTION: This e-mail originated outside the University of Southampton.

CAUTION: This e-mail originated outside the University of Southampton.

Dear all,

 

since very recently, one faculty of our university deposits its dissertations via Sword 2.0 API. The EP3XML with embedded PDF is deposited.

 

Everything works fine, however, the faculty observes that it takes unproportionally long the bigger the size of the PDF is, until they get process termination feedback:

 

  • 3.8 MB: 7 seconds
  • 16.5MB: 2min 30s
  • 22.6MB: 4min

 

Is such a behaviour known to you? Any adjusting screws?

 

We do some checks such as scanning for viruses or format determination using Droid. The former is done immediately in the document_validate.pl, the latter is being triggered after the document has been uploaded. So I don’t see any bottleneck in these processes.

 

Kind regards,

 

Martin

 

--

Dr. Martin Brändle
Zentrale Informatik
Universität Zürich
Pfingstweidstrasse 60B
CH-800
5 Zürich

 

 

*** Options: https://wiki.eprints.org/w/Eprints-tech_Mailing_List
*** Archive: https://www.eprints.org/tech.php/
*** EPrints community wiki: https://wiki.eprints.org/
 

 

*** Options: https://wiki.eprints.org/w/Eprints-tech_Mailing_List
*** Archive: https://www.eprints.org/tech.php/
*** EPrints community wiki: https://wiki.eprints.org/
 

 

*** Options: https://wiki.eprints.org/w/Eprints-tech_Mailing_List
*** Archive: https://www.eprints.org/tech.php/
*** EPrints community wiki: https://wiki.eprints.org/