[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[EP-tech] Re: Injecting gigabyte-scale files into EPrints archive - impossible?



The only option seems to enlarge the tmp :-)

Il 01/08/2014 15:31, Florian He? ha scritto:
> Am 01.08.2014 11:52, schrieb Yuri:
>> There's no official documentation about toolbox, it should be documented
>> better.
>>
>> Can't you just use import with this options:
>>
>>       --enable-import-ids
>>                By default import will generate a new eprintid, or userid for
>>                each record. This option tells it to use the id spcified in the
>>                imported data. This is generally used for importing into a new
>>                repository from an old one.
>>
>>
>>        --enable-file-imports
>>                Allow the imported data to import files from the local
>>                filesystem. This can obviously be seen as a security hole
>> if you
>>                don't trust the data you are importing. This sets the
>>                "enable_file_imports" configuration option for this session
>>                only.
>>
>> after you've exported the eprints, modified the document section and
>> reimporting it?
>>
> Thanks, Yuri ...
>
> I've gone that way already I am afraid. If the system didn't try to
> upload, it wouldn't cry "not enough diskspace left on device".
>
> So that nothing remains untried, I run:
> bin/import $repo --enable-import-fields --enable-file-imports document
> XML $xmlfile
> Error! Unhandled exception in Import::XML: Can't write to
> '/tmp/E2FCKTjvNh': Auf dem Ger?t ist kein Speicherplatz mehr verf?gbar
> at /usr/share/perl5/LWP/Protocol.pm line 115. at
> /usr/lib/perl5/XML/LibXML/SAX.pm line 80 at
> .../eprints/bin/../perl_lib/EPrints/XML/LibXML.pm line 137
> (The above error message is just german here)
>
> Even dropped "file://" prefix hoping that would make the system run a
> plain filesystem operation (as the above docs imply), but it still uses LWP.
>
> It said "Download (0b)", and when I cp'd the file where it is expected
> it still "failed to get file contents". I finally solved this by
> studying the sources and then manually inserting values
> (FILEID,0,"Storage::Local") into files_copies_pluginid and values
> (FILEID,0,FILENAME) into files_copies_sourceid database table. It works
> now like a charm, but hacking the database should not be necessary,
> promise I will use the API in the future. ;-)
>
>> Another option is to use a Perl Library for efficient file handling and
>> change the code where it does
>>
>>     join("", <STDIN>)
> Still from get_data() is expected a string. This maybe wouldn't be the
> only place to change.
>
> The function should return a reference to a scalar, something like \do{
> local $/; scalar <STDIN> }, which I did not test however. This is known
> as the file-slurping idiom in perl. But this code is still dangerous,
> simply - i.e. erroneously - attach a neverending story to standard input
> and your system will have a hard time to provide infinite memory.
>
>
> Kind regards
> Florian
>
>
>>
>>
>>
>> Il 01/08/2014 11:25, Florian He? ha scritto:
>>> Hello developers and users,
>>>
>>> again I'm sorry I have to consult you concerning a problem we've run
>>> into and couldn't solve ourselves.
>>>
>>> We need to attach a big file to a document, i.e. one of 3g in size. We
>>> limited web upload to 100m by webserver configuration in order that we
>>> keep control of large file uploads. To get bigger file into the archive
>>> we successfully use the following command:
>>>
>>> /usr/bin/perl ~eprints/bin/toolbox $repo addFile \
>>>        --document $docid --filename $filename < /path/to/existing/file
>>>
>>> (Besides, is there a convenient way of getting the document id? It is
>>> rather tedious to upload a placeholder file so we can manually seek and
>>> grab a doc id by Firebug extension; after running the command, we open
>>> the EPrint file dialog in the document metadata to switch the main file
>>> and delete the placeholder.)
>>>
>>> I narrowed this method down to a line of code in
>>> EPrints::Toolbox::get_data() that I question is scalable for these
>>> dimensions (given our hardware memory space):
>>>
>>>         join("", <STDIN>)
>>>
>>> builds, in EPrints 3.3.10, a monstrous perl scalar that certainly is
>>> perpetually expanded and moved around in memory to fit in. I wonder if
>>> there is a way I can move the file to the expected place myself and
>>> adjust the file record in the EPrint database. Tried this already but at
>>> last I ended up downloading the tiny placeholder file again. I deleted
>>> the file in the console (rm), but then EPrints system threw "couldn't
>>> read file contents". So, somewhere things still were arranged for the
>>> old file. The browser displays, though, the right filename in the modal
>>> dialog offering to save or to open the file with a program whatsoever.
>>>
>>> The toolbox command was appallingly running more than two hours and
>>> gorging swap space like there was no tomorrow, then we killed it. It
>>> consumed 2% of CPU in average, status flag was "D" most of the time (man
>>> ps: "uninterruptable sleep (usually IO)"). It appeared to me it was
>>> constantly swapping.
>>>
>>> Today I tried the toolbox addDocument command which doesn't seem to save
>>> me work after all, it just requires xml data. But with
>>> <url>file:///path/of/file/to/import</url>, it runs out of disk space
>>> again while "downloading" that url in /tmp.
>>> Wish I could pass a path of a file to be copied directly, isn't that
>>> possible somehow?
>>>
>>>
>>> Kind regards
>>> Florian
>>>
>>>
>>
>> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
>> *** Archive: http://www.eprints.org/tech.php/
>> *** EPrints community wiki: http://wiki.eprints.org/
>> *** EPrints developers Forum: http://forum.eprints.org/
>>
>