[EP-tech] Injecting gigabyte-scale files into EPrints archive - impossible?

Florian Heß hess at ub.uni-heidelberg.de
Fri Aug 1 10:25:53 BST 2014

Hello developers and users,

again I'm sorry I have to consult you concerning a problem we've run 
into and couldn't solve ourselves.

We need to attach a big file to a document, i.e. one of 3g in size. We 
limited web upload to 100m by webserver configuration in order that we 
keep control of large file uploads. To get bigger file into the archive 
we successfully use the following command:

/usr/bin/perl ~eprints/bin/toolbox $repo addFile \
    --document $docid --filename $filename < /path/to/existing/file

(Besides, is there a convenient way of getting the document id? It is 
rather tedious to upload a placeholder file so we can manually seek and 
grab a doc id by Firebug extension; after running the command, we open 
the EPrint file dialog in the document metadata to switch the main file 
and delete the placeholder.)

I narrowed this method down to a line of code in 
EPrints::Toolbox::get_data() that I question is scalable for these 
dimensions (given our hardware memory space):

     join("", <STDIN>)

builds, in EPrints 3.3.10, a monstrous perl scalar that certainly is 
perpetually expanded and moved around in memory to fit in. I wonder if 
there is a way I can move the file to the expected place myself and 
adjust the file record in the EPrint database. Tried this already but at 
last I ended up downloading the tiny placeholder file again. I deleted 
the file in the console (rm), but then EPrints system threw "couldn't 
read file contents". So, somewhere things still were arranged for the 
old file. The browser displays, though, the right filename in the modal 
dialog offering to save or to open the file with a program whatsoever.

The toolbox command was appallingly running more than two hours and 
gorging swap space like there was no tomorrow, then we killed it. It 
consumed 2% of CPU in average, status flag was "D" most of the time (man 
ps: "uninterruptable sleep (usually IO)"). It appeared to me it was 
constantly swapping.

Today I tried the toolbox addDocument command which doesn't seem to save 
me work after all, it just requires xml data. But with 
<url>file:///path/of/file/to/import</url>, it runs out of disk space 
again while "downloading" that url in /tmp.
Wish I could pass a path of a file to be copied directly, isn't that 
possible somehow?

Kind regards

UB Heidelberg (Altstadt)
Plöck 107-109, 69117 HD
Abt. Informationstechnik

More information about the Eprints-tech mailing list