[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[EP-tech] OAI Harvesting
CAUTION: This e-mail originated outside the University of Southampton.
Morning!
Apologies for the slow reply. Firstly Andy and John thank you so much for your advice. It was really helpful and we DID manage to get a harvest with our RT2! As far as success goes that was all that could be hoped for. Immediately after that some more pressing work came up with our Data Repository which has resulted in it needing to be upgraded with some immediacy.
Andy; I am going to have to return to this and fully wrap my head around it. It seems clear to me that some attention to our OAI config is needed as it appears to be quite messy and this advice is going to be really helpful with that.
John; I had a look in the logs to see if anyone was harvesting using "Department" and realised that only our theses use that field and could find no clear evidence of it being used for harvesting. I've removed it on Test and will probably do the same on Live when the time comes. I suppose the true test will be when somebody contacts me saying they're unable to harvest our content. At which point I'll gently point them towards "Type" hopefully.
Thank you again for the help!
James
On Thu, Jan 20, 2022 at 2:30 PM John Salter <J.Salter at leeds.ac.uk<mailto:J.Salter at leeds.ac.uk>> wrote:
Hi James,
That's an 'interesting' set setup. The default (commented-out) offering for that set doesn't have the department.
At a guess, it might have been added to create some disambiguation between authors of the same name, but in different departments - but that makes no sense, as it's using their IDs, not names.
To answer your question - it looks like a data-quality issue.
The following are *not* the same thing:
setName Person = Molecular and Clinical Pharmacology
setSpec 706572736F6E3D4D6F6C6563756C617220616E6420436C696E6963616C20506861726D61636F6C6F6779
setName Person = Molecular and clinical pharmacology
setSpec 706572736F6E3D4D6F6C6563756C617220616E6420636C696E6963616C20706861726D61636F6C6F6779
setName Person = Department of Molecular and Clinical Pharmacology
setSpec 706572736F6E3D4465706172746D656E74206F66204D6F6C6563756C617220616E6420436C696E6963616C20506861726D61636F6C6F6779
NB the 'setSpec' is just the name represented as characters
My guidance would be:
- feed your weblogs through a tool to analyse the OAI-PMH requests, and see who's using what. If no one is using the 'person' sets, I think removing their definitions would speed your OAI-PMH interface up. I guess they were added for a reason at some point though - hopefully someone somewhere will know something about them!
- (possibly - based on the above) remove the 'Department' from that set definition.
- add another set for 'divisions' based on the 'divisions' field you are using
- on your test server add some sets for testing (see Andy's email) - this is a very useful approach for testing RT2 ?
Cheers,
John
From: eprints-tech-bounces at ecs.soton.ac.uk<mailto:eprints-tech-bounces at ecs.soton.ac.uk> [mailto:eprints-tech-bounces at ecs.soton.ac.uk<mailto:eprints-tech-bounces at ecs.soton.ac.uk>] On Behalf Of Andy Reid via Eprints-tech
Sent: 20 January 2022 13:21
To: eprints-tech at ecs.soton.ac.uk<mailto:eprints-tech at ecs.soton.ac.uk>; James Kerwin <jkerwin2101 at gmail.com<mailto:jkerwin2101 at gmail.com>>
Subject: Re: [EP-tech] OAI Harvesting
CAUTION: This e-mail originated outside the University of Southampton.
Hi James,
When I was setting up RT2, I ignored the predefined sets in Elements, and created custom sets for testing and for production. I set up a cfg.d/zzz_symplectic_oai.pl<https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fzzz_symplectic_oai.pl%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C622c9c406ad2432f063108d9ea10004e%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637798177733235053%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=L1QOp1hsNPNCK%2FLZnHGqKNn3mUVh9W9zBNPJK5G%2FzSk%3D&reserved=0>, and split the production harvest into full-text-public, full-text-restricted, and full-text-none (metadata-only). I forget the thinking behind that split, but it does cover everything, I believe.
I?m not sure if $c->{oai}->{custom_sets}} is something that is set up and parsed by default, or if you might need to enable that first. It was there, and I could edit it, so I did.
############################## PRODUCTION SETS ####################################################
#
# These are used in earnest by Symplectic Repository Tools 2
#
####################################################################################################
push @{$c->{oai}->{custom_sets}}, { spec => "full_text_none", name => "full_text_none", filters => [
{ meta_fields => [ "full_text_status" ], value=>"none", match=>"IN", merge=>"ANY" },
{ meta_fields => [ "eprint_status" ], value=>"archive", match=>"IN", merge=>"ANY" }, -- live records only, not in review or deleted
] };
push @{$c->{oai}->{custom_sets}}, { spec => "full_text_public", name => "full_text_public", filters => [
{ meta_fields => [ "full_text_status" ], value=>"public", match=>"IN", merge=>"ANY" },
{ meta_fields => [ "eprint_status" ], value=>"archive", match=>"IN", merge=>"ANY" },
] };
push @{$c->{oai}->{custom_sets}}, { spec => "full_text_restricted", name => "full_text_restricted", filters => [
{ meta_fields => [ "full_text_status" ], value=>"restricted", match=>"IN", merge=>"ANY" },
{ meta_fields => [ "eprint_status" ], value=>"archive", match=>"IN", merge=>"ANY" },
] };
For testing I had a variety of scratch sets, using named users, years, or lists of Eprint IDs:
e.g.
NAMED USER:
push @{$c->{oai}->{custom_sets}}, { spec => "symplectic_andy_email", name => "symplectic_andy_email", filters => [
{ meta_fields => [ "creators_id" ], value=>"andy REID lshtm", match=>"IN", merge=>"ALL" },
] };
SPECIFIC RECORDS:
push @{$c->{oai}->{custom_sets}}, { spec => "symplectic_test", name => "symplectic_test", filters => [
{ meta_fields => [ "eprintid" ], value=>"
4645869
4645797
4645491
4645719
4645785
4363558
4398757
4433720
3451639
2783042
19260
1924927
333704
3172489
3174428
1878135
4646586
4645489
4647623
4647670
",
match=>"IN",
merge=>"ANY" },
] };
#4645869 = article, OA, 2017
#4645797 = conference item, 2017
#4645491 = thesis, 2017
#4645719 = monograph
#4645458 = other, OA guide , library
#4363558 = book section [now recoded to article]
#4398757 = [Accepted manuscript] of 4363558
#3451639 = podcast
#2783042 = video
#2869451 = dataset
#19260 = patent
#1924927 = image
#333704 = artefact
# 4646586 exhibition
#https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fresearchonline.lshtm.ac.uk%2F4645489%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C622c9c406ad2432f063108d9ea10004e%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637798177733235053%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=m%2F1kIL%2F%2FEmMzJU8GYpeX8hRkAtZui9lN%2BSVlnbOJGSo%3D&reserved=0 Teaching Resource
#3172489 = [Accepted Manuscript]
#3174428 = Final version of above
#1878135/ = [Inc; Grosskurth, H;] Manually added author
MULTIPLE FILTERS:
push @{$c->{oai}->{custom_sets}}, { spec => "full_text_public_live_patel2016", name => "full_text_public_live_patel2016", filters => [
{ meta_fields => [ "eprint_status" ], value=>"archive", match=>"IN", merge=>"ANY" },
{ meta_fields => [ "full_text_status" ], value=>"public", match=>"IN", merge=>"ANY" },
{ meta_fields => [ "view_date" ], value=>"2016", match=>"IN", merge=>"ANY" },
{ meta_fields => [ "creators_id" ], value=>"vikram patel lshtm", match=>"IN", merge=>"ALL" }, -- matches Vikram.patel at lshtm.ac.uk<mailto:Vikram.patel at lshtm.ac.uk>
] };
Hope that is useful
Andy
From: <eprints-tech-bounces at ecs.soton.ac.uk<mailto:eprints-tech-bounces at ecs.soton.ac.uk>> on behalf of James Kerwin via Eprints-tech <eprints-tech at ecs.soton.ac.uk<mailto:eprints-tech at ecs.soton.ac.uk>>
Reply to: "eprints-tech at ecs.soton.ac.uk<mailto:eprints-tech at ecs.soton.ac.uk>" <eprints-tech at ecs.soton.ac.uk<mailto:eprints-tech at ecs.soton.ac.uk>>, James Kerwin <jkerwin2101 at gmail.com<mailto:jkerwin2101 at gmail.com>>
Date: Thursday, 20 January 2022 at 12:49
To: "eprints-tech at ecs.soton.ac.uk<mailto:eprints-tech at ecs.soton.ac.uk>" <eprints-tech at ecs.soton.ac.uk<mailto:eprints-tech at ecs.soton.ac.uk>>
Subject: [EP-tech] OAI Harvesting
*** This message originated outside LSHTM ***
________________________________
CAUTION: This e-mail originated outside the University of Southampton.
Hi All,
We're setting up RT2 (Elements) at the moment and working through some bugs. This is not a specific EPrints problem, but I'm hoping the collective wisdom of those here can provide some clarity...
In our OAI ListSets pages it has become apparent that we have duplicate sets. We appear to have a peculiar setup whereby we have :
$oai->{sets} = [
{ id=>"person", allow_null=>0, fields=>"contributors_id/editors_id/department" }
This puts department in the person set. We don't even use department in our current EPrints records (we have Divisions which I've spoken about a LOT previously). What I'm curious about is:
1) How do duplicate sets come about? I thought the idea of a set would be if items have the same value they would be in the same set.
2) Is there any easy way to identify the duplicate sets? Somebody from Symplectic that I'm working with was kind enough to point them out on our live repository and sure enough if I ctrl+f for "Molecular and Clinical Pharmacology" on https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Flivrepository.liverpool.ac.uk%2Fcgi%2Foai2%3Fverb%3DListSets&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C622c9c406ad2432f063108d9ea10004e%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637798177733235053%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=CxTLF0pMeWnPdhwKFCdTlHrvsngKx87XlzZ8W47dYxI%3D&reserved=0<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Flivrepository.liverpool.ac.uk%2Fcgi%2Foai2%3Fverb%3DListSets&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C622c9c406ad2432f063108d9ea10004e%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637798177733235053%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=CxTLF0pMeWnPdhwKFCdTlHrvsngKx87XlzZ8W47dYxI%3D&reserved=0> it appears twice.
I've tried to learn about OAI, but it does unfortunately make my brain scream because I just do not understand it properly.
Thanks,
James
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20220207/c51a9a96/attachment-0001.html