EPrints Technical Mailing List Archive

Message: #06196


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

Re: [EP-tech] Migrating output data - most efficient way


Hi Dan,

 

Thanks for this remarkably thorough write up! One thing I wasn’t aware of was the export function; I was grabbing the XML from the front end but this should obviously fill in any gaps. I am, however, having problems with it!

 

Using the syntax:

export <repositoryid> archive XML > <filename>.xml

gives me a file, however the file is never populated. I don’t know if this is an issue in our repository or if there is a patch anywhere? No errors are thrown on the command line so I’m at a loss! Stumbled at the first hurdle!

 

Andrew

 

From: eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk] On Behalf Of Stieneke, Dan
Sent: 20 January 2017 20:32
To: eprints-tech@ecs.soton.ac.uk
Subject: Re: [EP-tech] Migrating output data - most efficient way

 

I’ve done migrations of this sort several times. I never did a database dump, but instead exported/imported XML, on the premise when doing a database copy, mysql would only tell me about database errors, but an export/import would give eprints a chance to tell me about eprints errors.

 

 

On the most recent iteration, moving from 3.2.4 to 3.3.15, I ran into two problems migrating the actual eprints:

 

1)      Exporting was fine, but the import threw a ton of errors. Reversing the .XML file (so newest items were imported first, oldest last) worked. To import in reverse order:

a.        echo "select eprintid from eprint;" | mysql -u root -p ARCHIVEID > list.txt - get a list of current eprints

b.       edit list.txt to

                                                              i.      Remove top line, which is “eprints”

                                                            ii.      Remove any cancelled items. You have to make at least one import attempt to figure out what these are.
When you do a full
export, eprints doesn't emit cancelled or retired items. When you explicitly specify each record to export like this, it does. When you import using --enable-web-imports, and the new box tries to pull in cancelled or retired items, the old box refuses to give up the .PDF files, so you get errors.

c.       tac < list.txt > revlist.txt - reverse the list (tac is cat in reverse)

d.      tr '\n' ' ' < revlist.txt > revlistspaces.txt (go from one eprintID per line to a list of space-separated eprintIDs)

e.       ~/bin/export ARCHIVEID archive XML $(cat revlistspaces.txt) > REV.xml

                                                        i.      This depends on pulling the .PDF files in from the webserver. Use XMLFiles instead of XML to get the whole thing Base64-encoded in the .XML file

 

2)      Thesis info changed somewhere between those two versions

a.       Old phd info: <thesis_type>phd</thesis_type>

b.      Equivalent new phd info: <thesis_type>postdoctoral</thesis_type><thesis_name>phd</thesis_name>

c.       Fixit file : phdfix.sh

 #!/bin/bash

sed 's!<thesis_type>phd</thesis_type>!<thesis_type>postdoctoral</thesis_type><thesis_name>phd</thesis_name>!' < $1 > $2

d.      ./phdfix.sh REV.xml REV_phdfix.xml (names based on the reversal done above)

e.       ~/bin/epadmin erase_data ARCHIVEID Had some difficulty with erase_eprints, apparently some 'leftovers'. This blows away & rebuilds the entire database.

f.        ~/bin/import_subjects ARCHIVEID

g.      ~/bin/import ARCHIVEID --enable-import-fields --enable-web-imports --update --force --verbose archive XML REV_phdfix.xml

                                                              i.      --update overwrites older eprints w/the same ID instead of appending them to the end of the collection.

 

Once done, verify:

  1. NEWBox : ~/bin/export ARCHIVEID archive XML > NEW.xml
  2. OLDBox : ~/bin/export ARCHIVEID archive XML > OLD.xml
  3. massage the files to get rid of known / OK differences
    • FILE sedsOld.sh

#!/bin/bash

# clean up encoding changes

dos2unix $1

# name change

sed -i 's!OLDURL!NEWURL!g' $1

# double to single quote

sed -i "s/\"/'/g" $1

# this appears in different spots in old &amp; new formats

sed -i "s! xmlns='http://eprints.org/ep2/data/2.0'!!" $1

# new format uses FQDN

sed -i 's!<uri>/id/document/!<uri>http://FQDN/id/document/!' $1

# these ALL change - import increments them

sed -i '/<rev_number>.*<\/rev_number>/d' $1

# shows up in different spots than new format

sed -i '/<mime_type>application\/pdf<\/mime_type>/d' $1

# quite a few of these empties show in the old format, not the new

sed -i '/<id><\/id>/d' $1

sed -i '/<reported_by><\/reported_by>/d' $1

sed -i '/<resolved_by><\/resolved_by>/d' $1

sed -i '/<comment><\/comment>/d' $1

# this is an entire multi-line section that lives in the old, but not in the new

perl -i -0pe 's/<copies>.*?<\/copies>/DELETEME/sg' $1 # can't easily delete; this replaces with literal DELETEME

# Encoding of abstract has changed. Not always visibly obvious how (CRLFs, I think)

#perl -i -0pe 's/<abstract>.*?<\/abstract>/DELETEME/sg' $1 # can't easily delete; this replaces with literal DELETEME

sed -i '/DELETEME/d' $1

# thesis type changes between the two versions

sed -i '/<thesis_type>phd<\/thesis_type>/d' $1

# old version encodes dashes; new version uses dashes

sed -i 's/%2D/-/g' $1

# old version encodes this; new version ignores???

sed -i 's/&#13;//g' $1

b.       FILE sedsNew.sh

#!/bin/bash

 

# clean up encoding changes

dos2unix $1

# these ALL change - import increments them

sed -i '/<rev_number>.*<\/rev_number>/d' $1

# shows up in different spots than new format

sed -i '/<mime_type>application\/pdf<\/mime_type>/d' $1

# this appears in different spots in old &amp; new formats

sed -i "s! xmlns='http://eprints.org/ep2/data/2.0'!!" $1

# these are now encoded; old format was not

sed -i "s/&quot;/\'/g" $1

# these are now encoded; old format was not

sed -i "s/&apos;/\'/g" $1

# thesis type changes between the two versions

sed -i '/<thesis_type>phd<\/thesis_type>/d' $1

sed -i '/<thesis_type>postdoctoral<\/thesis_type>/d' $1

sed -i '/<thesis_name>phd<\/thesis_name>/d' $1

# Encoding of abstract has changed. Not always visibly obvious how (CRLFs, I think)

perl -i -0pe 's/<abstract>.*?<\/abstract>/DELETEME/sg' $1 # can't easily delete; this replaces with DELETEME

sed -i '/DELETEME/d' $1

  1. diff the files and see what's left
  2. This shows that lastmod is the date of import on the NEW file. To set it back to what it was:

A.  OLDbox: echo 'SELECT CONCAT("update eprint set lastmod_year=",lastmod_year,", lastmod_month=",lastmod_month,", lastmod_day=",lastmod_day,", lastmod_hour=",lastmod_hour,", lastmod_minute=",lastmod_minute,", lastmod_second=",lastmod_second," WHERE eprintid=",eprintid,";") FROM eprint;' | mysql -u root -p ARCHIVEID > lastmod.sql

    • lastmod.sql, get rid of first line (shows the command that created this file), and prepend the output with use ARCHIVEID;
    • NEWbox: mysql -u root -p < lastmod.sql
  1. REPEAT: Export new / run the seds on new / compare

 

How to migrate the users:

*** BEGIN FILE: MigrateUsers.sh ***

#/bin/bash

 

OLDBOX=IP_OR_FQDN

OLDMYSQLPWD=PASSWORDGOESHERE

NEWMYSQLPWD=PASSWORDGOESHERE

 

ssh root@${OLDBOX} "sudo -u eprints /usr/share/eprints3/bin/export ARCHIVEID user XML" > oldboxusers.xml

# HEREDOC to avoid quote escapement torture

# grep to remove the motd lines that get put in there

ssh root@${OLDBOX} <<HEREDOC | grep "^update user set password.*" > oldboxpasswords.sql

mysql -u root -p${OLDMYSQLPWD} ARCHIVEID --skip-column-names -e 'select concat("update user set password=\"", password, "\" where username=\"", username, "\";") from user;'

HEREDOC

sudo -u eprints /usr/share/eprints3/bin/import ARCHIVEID user XML oldboxusers.xml

mysql -u root -p${NEWMYSQLPWD} ARCHIVEID < oldboxpasswords.sql

rm oldboxpasswords.sql

rm oldboxusers.xml

*** END FILE: MigrateUsers.sh ***

 

Sorry for some of the weird outline-autonumbering  / indentation – all of this was copied / pasted from my documentation wiki.

 

 

Good luck,

Dan Stieneke

IT Specialist

USDA - ARS - NWISRL

3793 N 3600 E

Kimberly, ID 83341

208/423-6519

 

 

 

 

From: eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk] On Behalf Of Andrew Beeken
Sent: Friday, January 20, 2017 9:41 AM
To: eprints-tech@ecs.soton.ac.uk
Subject: [EP-tech] Migrating output data - most efficient way

 

Hello all!

 

So, my quest to get the Lincoln repository onto a more standard EPrints install continues! Thanks for help with Ubuntu versions etc; I’ve now got that knowledge together and can start to look at the migration. I know I’ve asked this in the past, but my scope has changed slightly. Now, instead of looking at recreating our live repository like for like, I’m just looking at migrating the output data into a fresh EPrints installation and working from there.

 

I know I realistically have two options – importing from an EPrints XML dump off the main site or through a database dump. I’m assuming that in both cases I’ll need to bring over essential field definition, phrase and workflow files to avoid causing issues, and I’m assuming that, ideally, the database route would be best to make sure that any “in review” or hidden items are also brought across. Main qustions:

 

·         Am I right in this assumption?

·         Can I JUST bring out a subset of tables for the deposit data or do I have to do a full database export? I’d ideally prefer to not do the latter as there are customisations to the user table in our live EPrints that I’m trying to avoid replicating in this new environment.

·         Are there any gotchas I should be aware of?

 

Thanks, as always, in advance!

Andrew


The University of Lincoln, located in the heart of the city of Lincoln, has established an international reputation based on high student satisfaction, excellent graduate employment and world-class research.


The information in this e-mail and any attachments may be confidential. If you have received this email in error please notify the sender immediately and remove it from your system. Do not disclose the contents to another person or take copies.

Email is not secure and may contain viruses. The University of Lincoln makes every effort to ensure email is sent without viruses, but cannot guarantee this and recommends recipients take appropriate precautions.

The University may monitor email traffic data and content in accordance with its policies and English law. Further information can be found at: http://www.lincoln.ac.uk/legal.





This electronic message contains information generated by the USDA solely for the intended recipients. Any unauthorized interception of this message or the use or disclosure of the information it contains may violate the law and subject the violator to civil or criminal penalties. If you believe you have received this message in error, please notify the sender and delete the email immediately.