EPrints Technical Mailing List Archive

Message: #06567


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

Re: [EP-tech] Dissecting the Documents folder


Hi Andrew,

> Do I ... put it in the new <eprints_root>/archives/<myarchive>/documents folder?
Because I have no idea what have to be done additionally in the following I describe my successful path of the past:

- Unpack your documents to /tmp/disc0/00/... e.g. (none of the thumbnails or indexcodes if crucial)

- Replace the leading part of <url> appropriately, i.e. insert the physical structure, by a sed call with following lines:
%s/http:\/\/eprints.lincoln.ac.uk\/\([0-9]\)\([0-9][0-9]\)\([0-9][0-9]\)\/\([0-9][0-9]\)/\/tmp\/disc0\/00\/0\1\/\2\/\3\/\4/
%s/http:\/\/eprints.lincoln.ac.uk\/\([0-9]\)\([0-9][0-9]\)\([0-9][0-9]\)\/\([0-9]\)/\/tmp\/disc0\/00\/0\1\/\2\/\3\/0\4/
%s/http:\/\/eprints.lincoln.ac.uk\/\([0-9][0-9]\)\([0-9][0-9]\)\/\([0-9][0-9]\)/\/tmp\/disc0\/00\/00\/\1\/\2\/\3/
%s/http:\/\/eprints.lincoln.ac.uk\/\([0-9][0-9]\)\([0-9][0-9]\)\/\([0-9]\)/\/tmp\/disc0\/00\/00\/\1\/\2\/0\3/
%s/http:\/\/eprints.lincoln.ac.uk\/\([0-9]\)\([0-9][0-9]\)\/\([0-9][0-9]\)/\/tmp\/disc0\/00\/00\/0\1\/\2\/\3/
%s/http:\/\/eprints.lincoln.ac.uk\/\([0-9]\)\([0-9][0-9]\)\/\([0-9]\)/\/tmp\/disc0\/00\/00\/0\1\/\2\/0\3/
%s/http:\/\/eprints.lincoln.ac.uk\/\([0-9][0-9]\)\/\([0-9][0-9]\)/\/tmp\/disc0\/00\/00\/00\/\1\/\2/
%s/http:\/\/eprints.lincoln.ac.uk\/\([0-9][0-9]\)\/\([0-9]\)/\/tmp\/disc0\/00\/00\/00\/\1\/0\2/
%s/http:\/\/eprints.lincoln.ac.uk\/\([0-9]\)\/\([0-9][0-9]\)/\/tmp\/disc0\/00\/00\/00\/0\1\/\2/
%s/http:\/\/eprints.lincoln.ac.uk\/\([0-9]\)\/\([0-9]\)/\/tmp\/disc0\/00\/00\/00\/0\1\/0\2/

- Take care of the spaces in the file path: fortunately we had file names without any spaces on our linux system, thus I have NO experience :-)

- Remove all <rev_number> tags by `xmlstarlet ed -d "//_:rev_number" in.xml > /tmp/out.xml` to restart the change history

- Check your import file by `~/Eprints/bin/import yourRepo --parse-only --force archive XML yourInput`

- Start final run by `~/Eprints/bin/import yourRepo --migration --force archive XML yourInput`

- If anything fails, restart after `~/Eprints/bin/import yourRepo erase_eprints`

> Which part of the xml needs rewriting to tell the import 
> where to look for the file?
none due to your url modification/specification

The numbering follows the order of entries in your import file, thus any gap will be gone, but some confusion during comparing could occur ...

Hth
Thomas