[EP-tech] OAI Harvesting

Andy Reid Andy.REID at lshtm.ac.uk
Thu Jan 20 13:20:38 GMT 2022


CAUTION: This e-mail originated outside the University of Southampton.
Hi James,
When I was setting up RT2, I ignored the predefined sets in Elements, and created custom sets for testing and for production. I set up a cfg.d/zzz_symplectic_oai.pl, and split the production harvest into full-text-public, full-text-restricted, and full-text-none (metadata-only). I forget the thinking behind that split, but it does cover everything, I believe.

I’m not sure if $c->{oai}->{custom_sets}} is something that is set up and parsed by default, or if you might need to enable that first. It was there, and I could edit it, so I did.

##############################  PRODUCTION SETS ####################################################
#
#  These are used in earnest by Symplectic Repository Tools 2
#
####################################################################################################


push @{$c->{oai}->{custom_sets}}, { spec => "full_text_none", name => "full_text_none", filters => [

                { meta_fields => [ "full_text_status" ], value=>"none", match=>"IN", merge=>"ANY" },
                  { meta_fields => [ "eprint_status" ], value=>"archive", match=>"IN", merge=>"ANY" },  -- live records only, not in review or deleted

] };



push @{$c->{oai}->{custom_sets}}, { spec => "full_text_public", name => "full_text_public", filters => [

                { meta_fields => [ "full_text_status" ], value=>"public", match=>"IN", merge=>"ANY" },
                  { meta_fields => [ "eprint_status" ], value=>"archive", match=>"IN", merge=>"ANY" },


] };
push @{$c->{oai}->{custom_sets}}, { spec => "full_text_restricted", name => "full_text_restricted", filters => [

                { meta_fields => [ "full_text_status" ], value=>"restricted", match=>"IN", merge=>"ANY" },
                  { meta_fields => [ "eprint_status" ], value=>"archive", match=>"IN", merge=>"ANY" },

] };


For testing I had a variety of scratch sets, using named users, years, or lists of Eprint IDs:

e.g.

NAMED USER:

push @{$c->{oai}->{custom_sets}}, { spec => "symplectic_andy_email", name => "symplectic_andy_email", filters => [

                { meta_fields => [ "creators_id" ], value=>"andy REID lshtm", match=>"IN", merge=>"ALL" },

] };

SPECIFIC RECORDS:
push @{$c->{oai}->{custom_sets}}, { spec => "symplectic_test", name => "symplectic_test", filters => [

                { meta_fields => [ "eprintid" ], value=>"
                4645869
                4645797
                4645491
                4645719
                4645785
                4363558
                4398757
                4433720
                3451639
                2783042
                19260
                1924927
                333704
                3172489
                3174428
                1878135
                4646586
                4645489
                4647623
                4647670

                ",
                match=>"IN",
                merge=>"ANY" },

] };

#4645869 = article, OA, 2017
#4645797 = conference item, 2017
#4645491 = thesis, 2017
#4645719 = monograph
#4645458 = other, OA guide , library
#4363558 = book section [now recoded to article]
#4398757 = [Accepted manuscript] of 4363558
#3451639 = podcast
#2783042 = video
#2869451 = dataset
#19260 = patent
#1924927 = image
#333704 = artefact
# 4646586  exhibition
#https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fresearchonline.lshtm.ac.uk%2F4645489%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cbcad7a7ef0604d89717608d9dc17a8fa%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637782816440274106%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=iCMi2auAHVS%2BoTiispPoLGm64PPT6Cmlb4N4BpFHhOA%3D&reserved=0  Teaching Resource

#3172489 = [Accepted Manuscript]
#3174428 = Final version of above
#1878135/ = [Inc; Grosskurth, H;]  Manually added author


MULTIPLE FILTERS:

push @{$c->{oai}->{custom_sets}}, { spec => "full_text_public_live_patel2016", name => "full_text_public_live_patel2016", filters => [

                 { meta_fields => [ "eprint_status" ], value=>"archive", match=>"IN", merge=>"ANY" },
                { meta_fields => [ "full_text_status" ], value=>"public", match=>"IN", merge=>"ANY" },
                 { meta_fields => [ "view_date" ], value=>"2016", match=>"IN", merge=>"ANY" },
                { meta_fields => [ "creators_id" ], value=>"vikram patel lshtm", match=>"IN", merge=>"ALL" },   -- matches Vikram.patel at lshtm.ac.uk

] };


Hope that is useful

Andy

From: <eprints-tech-bounces at ecs.soton.ac.uk> on behalf of James Kerwin via Eprints-tech <eprints-tech at ecs.soton.ac.uk>
Reply to: "eprints-tech at ecs.soton.ac.uk" <eprints-tech at ecs.soton.ac.uk>, James Kerwin <jkerwin2101 at gmail.com>
Date: Thursday, 20 January 2022 at 12:49
To: "eprints-tech at ecs.soton.ac.uk" <eprints-tech at ecs.soton.ac.uk>
Subject: [EP-tech] OAI Harvesting

*** This message originated outside LSHTM ***
________________________________
CAUTION: This e-mail originated outside the University of Southampton.
Hi All,

We're setting up RT2 (Elements) at the moment and working through some bugs. This is not a specific EPrints problem, but I'm hoping the collective wisdom of those here can provide some clarity...

In our OAI ListSets pages it has become apparent that we have duplicate sets. We appear to have a peculiar setup whereby we have :

$oai->{sets} = [
{ id=>"person", allow_null=>0, fields=>"contributors_id/editors_id/department" }

This puts department in the person set. We don't even use department in our current EPrints records (we have Divisions which I've spoken about a LOT previously). What I'm curious about is:

1) How do duplicate sets come about? I thought the idea of a set would be if items have the same value they would be in the same set.

2) Is there any easy way to identify the duplicate sets? Somebody from Symplectic that I'm working with was kind enough to point them out on our live repository and sure enough if I ctrl+f for "Molecular and Clinical Pharmacology" on https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Flivrepository.liverpool.ac.uk%2Fcgi%2Foai2%3Fverb%3DListSets&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cbcad7a7ef0604d89717608d9dc17a8fa%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637782816440274106%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=LexvvUGQKUbjL7b%2BqyFbkrn09VMGXlGTTbH4t6VDifI%3D&amp;reserved=0<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Flivrepository.liverpool.ac.uk%2Fcgi%2Foai2%3Fverb%3DListSets&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cbcad7a7ef0604d89717608d9dc17a8fa%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637782816440274106%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=LexvvUGQKUbjL7b%2BqyFbkrn09VMGXlGTTbH4t6VDifI%3D&amp;reserved=0> it appears twice.

I've tried to learn about OAI, but it does unfortunately make my brain scream because I just do not understand it properly.

Thanks,
James
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20220120/e9e64fb7/attachment-0001.html 


More information about the Eprints-tech mailing list