[EP-tech] Re: Injecting gigabyte-scale files into EPrints archive - impossible?

Florian Heß hess at ub.uni-heidelberg.de
Fri Aug 1 14:31:44 BST 2014

Am 01.08.2014 11:52, schrieb Yuri:
> There's no official documentation about toolbox, it should be documented
> better.
> Can't you just use import with this options:
>      --enable-import-ids
>               By default import will generate a new eprintid, or userid for
>               each record. This option tells it to use the id spcified in the
>               imported data. This is generally used for importing into a new
>               repository from an old one.
>       --enable-file-imports
>               Allow the imported data to import files from the local
>               filesystem. This can obviously be seen as a security hole
> if you
>               don't trust the data you are importing. This sets the
>               "enable_file_imports" configuration option for this session
>               only.
> after you've exported the eprints, modified the document section and
> reimporting it?

Thanks, Yuri ...

I've gone that way already I am afraid. If the system didn't try to 
upload, it wouldn't cry "not enough diskspace left on device".

So that nothing remains untried, I run:
bin/import $repo --enable-import-fields --enable-file-imports document 
XML $xmlfile
Error! Unhandled exception in Import::XML: Can't write to 
'/tmp/E2FCKTjvNh': Auf dem Gerät ist kein Speicherplatz mehr verfügbar 
at /usr/share/perl5/LWP/Protocol.pm line 115. at 
/usr/lib/perl5/XML/LibXML/SAX.pm line 80 at 
.../eprints/bin/../perl_lib/EPrints/XML/LibXML.pm line 137
(The above error message is just german here)

Even dropped "file://" prefix hoping that would make the system run a 
plain filesystem operation (as the above docs imply), but it still uses LWP.

It said "Download (0b)", and when I cp'd the file where it is expected 
it still "failed to get file contents". I finally solved this by 
studying the sources and then manually inserting values 
(FILEID,0,"Storage::Local") into files_copies_pluginid and values 
(FILEID,0,FILENAME) into files_copies_sourceid database table. It works 
now like a charm, but hacking the database should not be necessary, 
promise I will use the API in the future. ;-)

> Another option is to use a Perl Library for efficient file handling and
> change the code where it does
>    join("", <STDIN>)

Still from get_data() is expected a string. This maybe wouldn't be the 
only place to change.

The function should return a reference to a scalar, something like \do{ 
local $/; scalar <STDIN> }, which I did not test however. This is known 
as the file-slurping idiom in perl. But this code is still dangerous, 
simply - i.e. erroneously - attach a neverending story to standard input 
and your system will have a hard time to provide infinite memory.

Kind regards

> Il 01/08/2014 11:25, Florian Heß ha scritto:
>> Hello developers and users,
>> again I'm sorry I have to consult you concerning a problem we've run
>> into and couldn't solve ourselves.
>> We need to attach a big file to a document, i.e. one of 3g in size. We
>> limited web upload to 100m by webserver configuration in order that we
>> keep control of large file uploads. To get bigger file into the archive
>> we successfully use the following command:
>> /usr/bin/perl ~eprints/bin/toolbox $repo addFile \
>>       --document $docid --filename $filename < /path/to/existing/file
>> (Besides, is there a convenient way of getting the document id? It is
>> rather tedious to upload a placeholder file so we can manually seek and
>> grab a doc id by Firebug extension; after running the command, we open
>> the EPrint file dialog in the document metadata to switch the main file
>> and delete the placeholder.)
>> I narrowed this method down to a line of code in
>> EPrints::Toolbox::get_data() that I question is scalable for these
>> dimensions (given our hardware memory space):
>>        join("", <STDIN>)
>> builds, in EPrints 3.3.10, a monstrous perl scalar that certainly is
>> perpetually expanded and moved around in memory to fit in. I wonder if
>> there is a way I can move the file to the expected place myself and
>> adjust the file record in the EPrint database. Tried this already but at
>> last I ended up downloading the tiny placeholder file again. I deleted
>> the file in the console (rm), but then EPrints system threw "couldn't
>> read file contents". So, somewhere things still were arranged for the
>> old file. The browser displays, though, the right filename in the modal
>> dialog offering to save or to open the file with a program whatsoever.
>> The toolbox command was appallingly running more than two hours and
>> gorging swap space like there was no tomorrow, then we killed it. It
>> consumed 2% of CPU in average, status flag was "D" most of the time (man
>> ps: "uninterruptable sleep (usually IO)"). It appeared to me it was
>> constantly swapping.
>> Today I tried the toolbox addDocument command which doesn't seem to save
>> me work after all, it just requires xml data. But with
>> <url>file:///path/of/file/to/import</url>, it runs out of disk space
>> again while "downloading" that url in /tmp.
>> Wish I could pass a path of a file to be copied directly, isn't that
>> possible somehow?
>> Kind regards
>> Florian
> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
> *** Archive: http://www.eprints.org/tech.php/
> *** EPrints community wiki: http://wiki.eprints.org/
> *** EPrints developers Forum: http://forum.eprints.org/

UB Heidelberg (Altstadt)
Plöck 107-109, 69117 HD
Abt. Informationstechnik

More information about the Eprints-tech mailing list