[EP-tech] Problem depositing larger documents via SWORD 2.0
Andy Reid
Andy.REID at lshtm.ac.uk
Thu Sep 15 13:51:33 BST 2016
Hi Willem,
I’m not using eprints_wrapper as such, but a similar homemade process in PHP using base64_encode and the PHPcurl library, to push files to the SWORD 2.0 portal on eprints. I just tested with a 5MB zip file and the encoding and upload took about 4s. I don’t know offhand the spec of the virtual server it is running on, but I think it has 2GB RAM, running SUSE linux. Likewise I’m unsure of the spec at the eprints end, but it’s also a VM.
However it crashed on a 26MB file. I tried again with 3 x 8mb files and it worked fine, in about 10s.
Not sure if this helps, but it does suggest that base64 processing is not a problem in itself, time-wise, with average hardware at either end. The only obvious difference I can spot is that mine uses chunk_split to break up the base64 into lines, but how I arrived at that I can’t remember. Might be worth a try, works for me.
Andy
======================= Base64 encoding fragment ===========================
while ($f = mysql_fetch_array($files_result)) { #build file metadata and base64 data
$filenum++;
$filename = $f['file_oaManuscript'];
$filenamesafe= htmlspecialchars($filename ); #Who puts ampersands in filenames!!
$mimetype = $f['file_oaManuscript_mimetype'];
$maintype=$mimetype;
$mainfile=$filenamesafe;
if(FALSE === ($STUFF=file_get_contents($filebase.$filename))){die("\n\nfailed to get file: $filebase$filename");}
$base64=chunk_split(base64_encode($STUFF));
$hash=md5($base64);
$filesize = strlen($STUFF);
$file_modified= $f['modified_oaManuscript'];
$filesXML = "
<file>
<datasetid>document</datasetid>
<filename>$filenamesafe</filename>
<mime_type>$mimetype</mime_type>
<hash>$hash</hash>
<hash_type>MD5</hash_type>
<filesize>$filesize </filesize>
<mtime>$file_modified</mtime>
<data encoding='base64'>";
$filesXML .= $base64;
$filesXML .= "</data>
</file>";
==========CURL FRAGMENT=========================================================================================================
curl_setopt($ch, CURLOPT_URL, "http://researchonline.lshtm.ac.uk/id/contents");
curl_setopt($ch, CURLOPT_HEADER, 1);
$pkgheader=Array('X-Packaging: http://eprints.org/ep2/data/2.0',
'Content-Type: text/xml',
'Metadata-Relevant: true',
'X-Verbose: true' ,
'In-Progress: false'); # TRUE => user inbox; FALSE => review
curl_setopt($ch,CURLOPT_HTTPHEADER,$pkgheader);
$html_in="http://pubdb.lshtm.ac.uk/publications/OAmgr/OAmgr_upload/eprints_xml.php?filter=oaPub_ID&value=$oaPub_ID"; #fetches eprints XML
$data=file_get_contents($html_in);
curl_setopt($ch, CURLOPT_POST,1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
($result=curl_exec($ch) )|| die( "curl_exec failed: ". curl_error($ch));
From: eprints-tech-bounces at ecs.soton.ac.uk [mailto:eprints-tech-bounces at ecs.soton.ac.uk] On Behalf Of John Salter
Sent: 15 September 2016 11:25
To: eprints-tech at ecs.soton.ac.uk
Subject: Re: [EP-tech] Problem depositing larger documents via SWORD 2.0
Hi Willem,
I’ve had a quick look at the php code.
It’s base64 encoding the file, and adding it to the EPrintsXML it generates in a <document> element.
The encoding (and decoding at the other end) takes some time – and is probably not the correct process for larger files.
This is the process that I think *should* be used in this scenario:
http://swordapp.github.io/SWORDv2-Profile/SWORDProfile.html#protocoloperations_creatingresource_multipart
but I’m not sure if the EPrintsWrapper class can do this…
Others on this list have more SWORD experience than me – hopefully someone will be able to provide a bit more advice.
Cheers,
John
From: eprints-tech-bounces at ecs.soton.ac.uk<mailto:eprints-tech-bounces at ecs.soton.ac.uk> [mailto:eprints-tech-bounces at ecs.soton.ac.uk] On Behalf Of W. Struiksma
Sent: 14 September 2016 14:13
To: eprints-tech at ecs.soton.ac.uk<mailto:eprints-tech at ecs.soton.ac.uk>
Subject: [EP-tech] Problem depositing larger documents via SWORD 2.0
Hi all,
I'm currently having problems depositing larger documents (> 5 MB) via SWORD 2.0. I'm using a PHP script that uses EPrintsWrapper.php. In this script the EPrints XML (including document) is posted via cURL.
https://github.com/davidfkane/eprintsDepositHelper/blob/master/EPrintsWrapper.php
The deposit takes a very long time (8 minutes for 26 MB) and the Apache process goes to a 100% processor capacity.
Has anyone experienced the same behaviour before? What can I do about it?
We use EPrints 3.3.13.
Thanks in advance!
Sincerely,
Willem Struiksma
University of Groningen
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20160915/1e57a4e1/attachment-0001.html
More information about the Eprints-tech
mailing list