[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[EP-tech] Antwort: Thesis Bulk Upload/Import



Hi James,

we did recently import in our repository about 3000 metadata records and
PDFs from Swiss National Licence program and attached about a further 2000
PDFs to existing metadata.
Currently I'm working on importing about 4000 e-theses (metadata + PDF)
and later 60'000 metadata records of print theses of University of Zurich
(back into 19th century) from UZH's library system Aleph. This will
increase the current size of our repo by 50%.

1) Biggest pro of having all documents in one repo is findability - you
don't want the user to have to search several times in different repos.
Con is that if one does not have the full-text (as above), the overall
full-text and OA ratio may be diluted.

2) Was answered by David Newman. Be aware that the code by Neugebauer and
Han for ingesting documents is not up-to-date and did not work an EPrints
3.3 repository - had to learn that the hard way. If you need code samples
let me know.

3) There may be not something as a preferred or ideal format. You have to
work with what you get from the data provider. In our case, this meant
writing our own import scripts and plug-ins. Also, there may be data
quality issues, which means one has to do thorough data analysis before and
massive data massaging during import (if you have XML data, XSLT 2.0 is
your friend because of its strong grouping and sorting facilities). And one
has to be prepared to implement error handling for all kind of errors that
can be caused by wrong, incomplete or missing data.

In the case of National Licenses, this involved:
- getting CSV files from the data provider
- 1 script and 2 import plug-ins (NationalLicense, DOI)
- filtering out wrong records because the provider did an unsufficient
affiliation matching and there were als records from ETH Zurich (instead of
University of Zurich)
- extracting the DOIs, then do an duplicate match or import via DOI plugin
to which a separate handler had to be passed
- do a guess of the Dewey classification based on the ISSN of the journal
where the article was published using our journal database
- fetching the abstracts from a separate URL - the abstracts were not
stored in the CSV and sometimes are not available via Crossref
- adding missing fields that are not available in the metadata (e.g.
publication status, subject, OA status, copyright, and so on)
- downloading the PDFs and attaching to the eprint, setting language,
format, conent, embargo and security, and making thumbnails on the fly
- printing a report of the import (success and failures, detected
duplicates)


In the case of the e-theses:
- getting a combined MARCXML/Adam XML file from the provider
- inserting a separate XML element per MARC record into the file that
groups a MARC record (M) and the associated ADAM records (A) - the file had
the implicit assumption that ADAM records that immediately follow the MARC
record belong to the preceding MARC record. However, this is not parsable
(there is no schema). So I went from a structure like Root{M A A A M A M A
A M A A A M A M A A A ...}  to something like Root{Doc(M A A A) Doc(M A)
Doc(M A A) Doc(M A A A) Doc(M A) Doc(M A A A) ...}
- doing a tag analysis of both M and A using XSLT, then deciding on the
mapping to EPrints fields.
- doing a content analysis of each tag using XSLT by grouping and sorting
the content alphabetically. This revealed the whole data nightmare:
Inconsistent cataloging due to three different cataloging rulesets that
were applied over time, escaped words because of old cataloging rules for
indexing, missing data, typos, unusable additional phrases, inconsistent
cataloging of author names in different fields (in 100_a: family, given, in
245_c: given family, the latter being impossible to parse correctly because
of composed family names), and surprises such as that a thesis may be
authored by several authors, but only the first author is recorded in
100_a)
- 1 script, 1 import plug-in (AlephMarc), 1 config file for mapping MARC
--> eprint metadata
- extracting the metadata and data massaging
- downloading the PDF of the full-text Adam record and attaching to the
eprint, setting language, format, content, embargo and security, and making
thumbnails on the fly
- downloading the PDF of the Adam record for the abstract, doing pdftotext
conversion, extracting the abstract and removing title and author
information from the abstract
- doing pdftotext conversion of the full-text's cover page, trying to guess
the faculty (which is often not available in the metadata) that is a
required field in the UZH repo
- marking problems in a special eprints field to the review team
- printing a report of the import (success and failures, detected
duplicates)


Best regards,

Martin

--
Dr. Martin Br?ndle
Zentrale Informatik
Universit?t Z?rich
Stampfenbachstr. 73
CH-8006 Z?rich

mail: martin.braendle at id.uzh.ch
phone: +41 44 63 56705
fax: +41 44 63 54505
https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.zi.uzh.ch&data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cf742716f4c704bddbbc208d67c8468fe%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&sdata=lgsdN2wmaxk03LiWMCTkWb9H0FleYH4hVLik60Z0cd0%3D&reserved=0



Von:	"James Kerwin via Eprints-tech" <eprints-tech at ecs.soton.ac.uk>
An:	<eprints-tech at ecs.soton.ac.uk>
Datum:	17.01.2019 11:21
Betreff:	[EP-tech] Thesis Bulk Upload/Import
Gesendet von:	eprints-tech-bounces at ecs.soton.ac.uk



Hi All,

The University I work at is currently exploring options for digitising our
collection of theses, with an aim of them going into the institutional
repository and I have some questions if anybody could lend me some of their
experience and opinions.

1) I've noticed some organisations have a separate instance of EPrints for
theses. We currently put each thesis into the institutional repository
along with all other types of item. Is there a benefit to separating them
out?

2) Does EPrints facilitate any sort of bulk upload of Documents and EPrint
record creation? I've had a quick look around and found the following from
Tomasz Neugebauer and Bin Han:

https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.researchgate.net%2Fpublication%2F291251891_Batch_Ingesting_into_EPrints_Digital_Repository_Software&amp;data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cf742716f4c704bddbbc208d67c8468fe%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&amp;sdata=m9Tqch2yiacJyFFdJDWzZx%2B9sL8QzsGzwG%2F%2F034iY9s%3D&amp;reserved=0

I'm curious to see if this is still relevant (it's very thorough) or if
there are any other methods or potential pitfalls to avoid.

3) Following on from Q2, is there a preferred/ideal format of metadata? The
article makes it clear that many different formats are supported, but again
I'm wondering if there are any pros and cons to any particular format.

The digitising won't be complete for some time so I'm taking the
opportunity to get ahead of it and be ready.

Thanks,
James
*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&amp;data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cf742716f4c704bddbbc208d67c8468fe%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&amp;sdata=smf76u1izShHUrjEbAS%2FGXSYQb4c4uLrPgBvKa8mFlg%3D&amp;reserved=0
*** EPrints community wiki: https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&amp;data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cf742716f4c704bddbbc208d67c8468fe%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&amp;sdata=aagIv%2Fu2g1ODqcKPDh3%2BSxKzllbMg%2FLTwXLbUGbWcFs%3D&amp;reserved=0
*** EPrints developers Forum: https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fforum.eprints.org%2F&amp;data=01%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cf742716f4c704bddbbc208d67c8468fe%7C4a5378f929f44d3ebe89669d03ada9d8%7C1&amp;sdata=5FPLSdmwXGC9V2zQ6Zfsg3YLZTGcG76iBlP92EQA2qE%3D&amp;reserved=0

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20190117/c65087a4/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
Url : http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20190117/c65087a4/attachment-0001.gif