EPrints Technical Mailing List Archive

Message: #05933


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

Re: [EP-tech] Problem depositing larger documents via SWORD 2.0


Hi Willem,

I’m not using eprints_wrapper as such, but a similar homemade process in PHP using base64_encode and the PHPcurl library, to push files to the SWORD 2.0 portal on eprints.  I just tested with a 5MB zip file and the encoding and upload took about 4s.  I don’t know offhand the spec of the virtual server it is running on, but I think it has 2GB RAM, running SUSE linux.  Likewise I’m unsure of the spec at the eprints end, but it’s also a VM.

 

However it crashed on a 26MB file.  I tried again with 3 x 8mb files and it worked fine, in about 10s.

 

Not sure if this helps, but it does suggest that base64 processing is not a problem in itself, time-wise, with average hardware at either end.  The only obvious difference I can spot is that mine uses chunk_split to break up the base64 into lines, but how I arrived at that I can’t remember.  Might be worth a try, works for me.

 

 

Andy

 

======================= Base64 encoding fragment ===========================

 

while ($f = mysql_fetch_array($files_result)) { #build file metadata and base64 data

                        $filenum++;

                                $filename = $f['file_oaManuscript'];

$filenamesafe= htmlspecialchars($filename );  #Who puts ampersands in filenames!!

                                $mimetype = $f['file_oaManuscript_mimetype'];

                               

                                $maintype=$mimetype;

$mainfile=$filenamesafe;

                                if(FALSE === ($STUFF=file_get_contents($filebase.$filename))){die("\n\nfailed to get file: $filebase$filename");}

                                $base64=chunk_split(base64_encode($STUFF));

                                $hash=md5($base64); 

                                $filesize = strlen($STUFF);

                                $file_modified= $f['modified_oaManuscript'];

     

                               

                               

$filesXML = "

     

                                                 <file>

                     

                                                                <datasetid>document</datasetid>

                                                               

                                                                <filename>$filenamesafe</filename>

                                                                <mime_type>$mimetype</mime_type>

                                                                <hash>$hash</hash>

                                                                <hash_type>MD5</hash_type>

                                                                <filesize>$filesize </filesize>

                                                                <mtime>$file_modified</mtime>

                                                      

                                                                <data encoding='base64'>";

 

$filesXML .= $base64;

 

$filesXML .= "</data>

                </file>";

 

==========CURL FRAGMENT=========================================================================================================

 

 

curl_setopt($ch, CURLOPT_URL, "http://researchonline.lshtm.ac.uk/id/contents");

curl_setopt($ch, CURLOPT_HEADER, 1);

 

 

 

$pkgheader=Array('X-Packaging: http://eprints.org/ep2/data/2.0',

                 'Content-Type: text/xml',

                 'Metadata-Relevant: true',

                 'X-Verbose: true' ,

                 'In-Progress: false'); # TRUE => user inbox;  FALSE => review              

curl_setopt($ch,CURLOPT_HTTPHEADER,$pkgheader);

 

 

$html_in="http://pubdb.lshtm.ac.uk/publications/OAmgr/OAmgr_upload/eprints_xml.php?filter=oaPub_ID&value=$oaPub_ID";  #fetches eprints XML

$data="">

curl_setopt($ch, CURLOPT_POST,1);

curl_setopt($ch, CURLOPT_POSTFIELDS, $data);

 

curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);

 

($result=curl_exec($ch) )|| die( "curl_exec failed: ". curl_error($ch));

 

 

 

 

 

 

From: eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk] On Behalf Of John Salter
Sent: 15 September 2016 11:25
To: eprints-tech@ecs.soton.ac.uk
Subject: Re: [EP-tech] Problem depositing larger documents via SWORD 2.0

 

Hi Willem,

I’ve had a quick look at the php code.

It’s base64 encoding the file, and adding it to the EPrintsXML it generates in a <document> element.

 

The encoding (and decoding at the other end) takes some time – and is probably not the correct process for larger files.

 

This is the process that I think *should* be used in this scenario:

http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html#protocoloperations_creatingresource_multipart

but I’m not sure if the EPrintsWrapper class can do this…

 

Others on this list have more SWORD experience than me – hopefully someone will be able to provide a bit more advice.

 

Cheers,

John

 

 

From: eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk] On Behalf Of W. Struiksma
Sent: 14 September 2016 14:13
To: eprints-tech@ecs.soton.ac.uk
Subject: [EP-tech] Problem depositing larger documents via SWORD 2.0

 

Hi all,

 

I'm currently having problems depositing larger documents (> 5 MB) via SWORD 2.0. I'm using a PHP script that uses EPrintsWrapper.php. In this script the EPrints XML (including document) is posted via cURL.

 

 

The deposit takes a very long time (8 minutes for 26 MB) and the Apache process goes to a 100% processor capacity.

 

Has anyone experienced the same behaviour before? What can I do about it?

 

We use EPrints 3.3.13.

Thanks in advance!

Sincerely,
Willem Struiksma
University of Groningen