[EP-tech] Re: Memory usage in 3.2, Sword 1.3 and epdata packages
mark.gregson at qut.edu.au
Fri Jul 19 01:30:24 BST 2013
Thanks Tim, I'm glad to hear that we won't have to revisit the issue in 3.3.
I eventually threw out the text parser in favour of a custom SAX filter, which seems like the obvious way to implement his hack in retrospect ...
From: eprints-tech-bounces at ecs.soton.ac.uk [mailto:eprints-tech-bounces at ecs.soton.ac.uk] On Behalf Of Tim Brody
Sent: Wednesday, 17 July 2013 6:07 PM
To: eprints-tech at ecs.soton.ac.uk
Subject: [EP-tech] Re: Memory usage in 3.2, Sword 1.3 and epdata packages
The CRUD code streams the input into a temporary file, so there should never be any memory pressure. A temporary file is used because the post/put is first checked against its checksum.
There's still scope for improvement by removing the temporary file stage but that will involve watching the input stream and backing-out of a partial construction in the event of an error.
Otherwise, future EPrints versions will support ranged PUTs, which is how the AJAX upload works. That allows arbitrarily sized objects plus the ability to resume.
On Wed, 2013-07-17 at 13:29 +1000, Mark Gregson wrote:
> Thanks Ian and Tim.
> I didn't do any substantial hunting for leaks but after familiarising with the code I realised a few hacks could reduce the memory footprints considerably. I ended up patching some of the request body handling code from Apache::CRUD in 3.3 back into the DepositHandler in 3.2, which made some modest gains but the memory usage of the XML parsing was still massive (something like 5 times the size of the original file).
> I didn't want to mess around with internals too much by trying to back port the 3.3 SAX parsing so instead I wrote a dirty little text parser that extracts the Base64 data into temporary files and replaces it with a URL element referencing the file. enable_file_imports must be set in the repository config for this to work. Currently this code is in the DepositHandler and runs when the request body is written to disk but it would be better encapsulated in the input_file method of the import plugin.
> With these changes files never enter memory and so memory demand is completely decoupled from the file size (except possibly somewhere lower in the stack?).
> I think this memory usage will be less than 3.3, which I assume will read the entire chunk of Base64 data into memory during the SAX XML parsing before writing it to a temp file, at the expense of the inelegance of text-parsing an XML file and having to enable_file_imports. I think it will suffice until we upgrade!
> -----Original Message-----
> From: eprints-tech-bounces at ecs.soton.ac.uk
> [mailto:eprints-tech-bounces at ecs.soton.ac.uk] On Behalf Of Ian Stuart
> Sent: Friday, 12 July 2013 8:09 PM
> To: eprints-tech at ecs.soton.ac.uk
> Subject: [EP-tech] Re: Memory usage in 3.2, Sword 1.3 and epdata
> I also believe that the SWORD 2 implementation of the XML importer
> *does* replicate the SWORD 1.3 process: you can send it a complete file and it will do the CREATE and UPDATE processes in one go.
> On 12/07/13 10:51, Tim Brody wrote:
> > Correct. In 3.2 the HTTP post is all worked on in memory. In 3.3 XML
> > data are streamed and will be written to disk as it arrives.
> > /Tim.
> > On Fri, 2013-07-12 at 08:26 +0100, Ian Stuart wrote:
> >> With no real knowledge, and certainly no investigation.... I would
> >> suspect the problem is actually with how the base64 files are
> >> handled, rather then being an EPrints memory leak per sae.
> >> From the SWORD importers I've written, the process seems to be to
> >> 1) read in the deposit
> >> 2) unpack the deposit (zip into disk space, XML into memory)
> >> 3) create the eprint object
> >> 4) attach the files
> >> 5) write everything out
> >> So I would suspect that what's happening is that all your base64
> >> files are created (in memory) from the XML (which is also in
> >> memory)
More information about the Eprints-tech