[EP-tech] Re: Memory usage in 3.2, Sword 1.3 and epdata packages
mark.gregson at qut.edu.au
Wed Jul 17 04:29:29 BST 2013
Thanks Ian and Tim.
I didn't do any substantial hunting for leaks but after familiarising with the code I realised a few hacks could reduce the memory footprints considerably. I ended up patching some of the request body handling code from Apache::CRUD in 3.3 back into the DepositHandler in 3.2, which made some modest gains but the memory usage of the XML parsing was still massive (something like 5 times the size of the original file).
I didn't want to mess around with internals too much by trying to back port the 3.3 SAX parsing so instead I wrote a dirty little text parser that extracts the Base64 data into temporary files and replaces it with a URL element referencing the file. enable_file_imports must be set in the repository config for this to work. Currently this code is in the DepositHandler and runs when the request body is written to disk but it would be better encapsulated in the input_file method of the import plugin.
With these changes files never enter memory and so memory demand is completely decoupled from the file size (except possibly somewhere lower in the stack?).
I think this memory usage will be less than 3.3, which I assume will read the entire chunk of Base64 data into memory during the SAX XML parsing before writing it to a temp file, at the expense of the inelegance of text-parsing an XML file and having to enable_file_imports. I think it will suffice until we upgrade!
From: eprints-tech-bounces at ecs.soton.ac.uk [mailto:eprints-tech-bounces at ecs.soton.ac.uk] On Behalf Of Ian Stuart
Sent: Friday, 12 July 2013 8:09 PM
To: eprints-tech at ecs.soton.ac.uk
Subject: [EP-tech] Re: Memory usage in 3.2, Sword 1.3 and epdata packages
I also believe that the SWORD 2 implementation of the XML importer
*does* replicate the SWORD 1.3 process: you can send it a complete file and it will do the CREATE and UPDATE processes in one go.
On 12/07/13 10:51, Tim Brody wrote:
> Correct. In 3.2 the HTTP post is all worked on in memory. In 3.3 XML
> data are streamed and will be written to disk as it arrives.
> On Fri, 2013-07-12 at 08:26 +0100, Ian Stuart wrote:
>> With no real knowledge, and certainly no investigation.... I would
>> suspect the problem is actually with how the base64 files are
>> handled, rather then being an EPrints memory leak per sae.
>> From the SWORD importers I've written, the process seems to be to
>> 1) read in the deposit
>> 2) unpack the deposit (zip into disk space, XML into memory)
>> 3) create the eprint object
>> 4) attach the files
>> 5) write everything out
>> So I would suspect that what's happening is that all your base64
>> files are created (in memory) from the XML (which is also in memory)
Developer: ORI, RJ-Broker, and OpenDepot.org Bibliographics and Multimedia Service Delivery team, EDINA, The University of Edinburgh.
This email was sent via the University of Edinburgh.
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: http://www.eprints.org/tech.php/
*** EPrints community wiki: http://wiki.eprints.org/
More information about the Eprints-tech