[GOAL] Re: PMC & UKPMC Should Harvest From Institutional Repositories

Stevan Harnad amsciforum at gmail.com
Fri Apr 13 23:33:44 BST 2012


As Johanna McEntyre of EBI has raised an important series of questions
about institutional deposit and institution-external harvesting (by
PMC and UKPMC) versus direct institution-external deposit (in PMC and
UKPMC) so I have replied in quote/commentary format:

On Fri, Apr 13, 2012 at 11:27 AM, Johanna McEntyre <mcentyre at ebi.ac.uk> wrote:

> Stevan,
>
> Thanks for these comments on how PMC & UKPMC could be improved. While I can't respond to the mandate changes suggested, I can comment on the suggestion that UKPMC should harvest/link to IR versions of papers.
>
> We have considered doing this in some depth.  However, for a number of reasons this is not as straightforward to actually do as it is to say:
>
> (1) Firstly, UKPMC is a full text article database. Harvesting protocols such as OAI-PMH deal in metadata only. UKPMC is already supplemented by PubMed, Agricola, and EPO patent abstracts (about 26 million of them), so it is unclear how much content routine harvesting would add.


It will (i) add to UKPMC all UK biomedical research output that is
currently being self-archived -- spontaneously or mandatorily -- in
its respective authors' respective institutional repositories but not
mandated for UKPMC deposit.

Much more important, it will (ii) greatly facilitate and strengthen
the adoption of self-archiving mandates by the rest of the UK's
institutions, thereby (iii) generating much more UK OA content (in all
disciplines) -- including  much more UK biomedical output for
harvesting into UKPMC.

> (2) Secondly, there is no clean way to identify life science & related content in IRs (this is a matter of research not production-level functionality), apart from perhaps resolving metadata to PMIDs, which then of course would not add new content to UKPMC.


If UKPMC harvested from IRs (and, even more important, if the funders
that now mandate direct deposit in UKPMC instead mandated deposit in
IRs, for harvesting by UKPMC), the software for identifying UK
biomedical output would rapidly (and happily) be developed.

The lack of identifying software is not the problem: the lack of
institutional self-archiving mandates is; and funders insisting on
UKPMC instead of IR deposit and UKPMC harvest compounds the problem
instead of contributing to its solution.

> (3) Thirdly, because UKPMC is primarily interested in full text articles, we would want to identify those records in IRs that have full text. Again, there is no clean programmatic way of doing this that we know of. If anyone knows how to do this programmatically then we would be interested in learning how.


This too is a problem that IR software can easily solve -- if given
the incentive of (a) IR deposit mandates and (b) UKPMC harvesting
capability.

> (4) Finally, PMC & UKPMC (and PMC Canada) archive full text articles in XML. This structured content facilitates:
>
> (a) linking to related public life science databases such as UniProt;
> (b) operations such as text mining and smart indexing (e.g. restricting searches to figure legends);
> (c) insures the integrity of the archive since viewed articles are rendered from the XML database to HTML on the fly, and
> (d) reuse by third parties, in the case of OA articles.


That's all fine, for the OA content already being deposited in UKPMC ( + PMC).

But that is only a small fraction of total biomedical (or UK
biomedical) output, all of which is provided by institutions.

Surely additional OA content, even if less optimally tagged, is
preferable to less OA content, optimally tagged. That will also
provide the incentive to upgrade the tagging of the extra IR content
to XML -- and eventually IRs will graduate to XML too: but first
things first. And the overwhelming priority is not XML but OA itself!

> Therefore, in the event that we could identify life science full text articles in IRs, we would want to add the ones we don't already have to UKPMC, not just link to them. For those articles, there is a lack of clarity regarding licensing information. Establishing the license of a given article currently requires a manual process and therefore is not at all scalable or sustainable. The only way around this that I can envision is for licensing information to be represented formally in structured data, with the best enabling licenses for content exchange being CC-BY or CC0.


Same reply about licensing as about XML tagging, above:  Surely
additional OA content, even if less optimally licensed, is preferable
to less OA content, optimally licensed.

> If we harvest full-text content into UKPMC - which we do not have to right to harvest - we know from experience that this would be subject to a take-down request.  Harvesting content, converting it to XML, and then being asked to remove it from the repository is not a strategy we wish to follow.


That provides yet another good reason for just harvesting the metadata
and URL for the time being. It will facilitate the generation of much
more OA, for the reasons mentioned, and eventually will lead to
optimal tagging and licensing too.

> Content exchange to maximize usage in different contexts need not be a one-way process. Another option to consider is to encourage authors to deposit centrally (so we can do the things listed above) and then push content from UKPMC to populate IRs, for the purpose of institutional reporting, for example. We have an FTP site of OA articles: http://ukpmc.ac.uk/ftp/oa (there are over 400,000 OA articles there now) and will soon be releasing a web service that will retrieve metadata and full text (in the case of OA articles).


There are perhaps major 3-4 discipline-based central repositories of
any nontrivial size (mainly Arxiv in physics, PMC/UKPMC in biomedicine
and SSRN in social sciences). In contrast, there are at least 10,000
research active institutions generating all of the planet's research
output in at least 40 STM and humanities disciplines.

Do you really think that a realistic and natural way to make the
research output of all those institutions and disciplines OA is to
wait for it to be spontaneously deposited in an institution-external
repository, and then back-harvest it to the institution from which
originated?

What is needed is institutional self-archiving mandates, for all
research, funded and unfunded. Funder mandates that require
institution-external deposits, and institution-external repositories
that require direct deposit instead of harvesting are needlessly
creating impediments to the adoption and implementation of OA mandates
by the universal providers of all research, funded and funded: the
planet's universities and research institutes.

> I'd also like to add that we are actively exploring how UKPMC can integrate with IRs, in particular with respect to related data resources via the EBI's partnership in the OpenAIRE Plus project. We will be continuing to collaborate to explore how IRs and UKPMC can interoperate better.

The returns from integrating with the sparse contents of IRs (most of
them unmandated, hence near empty) are a far cry from what they could
be if PMC and UKPMC (and funder mandates!) took the simple step of
harvesting from IRs instead of requiring direct institution-external
deposit.

Stevan Harnad
>
> Jo McEntyre
>
>
> On Apr 12, 2012, at 12:05 PM, Stevan Harnad wrote:
>
> > On 2012-04-12, at 5:44 AM, Steve Hitchcock wrote:
> >
> >> Do we know why Pubmed does not apparently link to papers in IRs?
> >> Is this Pubmed policy, or is there a technical reason?
> >>
> >> Stephen Curry: PubMed, the first port of call for anyone searching
> >> the biomedical literature, frequently links to publisher’s site but
> >> never to institutional repositories
> >> http://occamstypewriter.org/scurry/2012/03/18/elsevier-the-research-works-act-and-open-access-where-to-now/
> >
> > PubMed & PubMed Central are wonderful resources, but not nearly
> > as resourceful or wonderful as they easily could be.
> >
> > (1) PMC & UKPMC should of course be harvesting or linking
> > institutional repository (IR) versions of papers, not just
> > PMC/UKPMC-deposited and publisher-hosted papers.
> >
> > (2) Funders should be mandating IR deposit and PMC harvesting
> > rather than direct PMC deposit. By thus making funder mandates
> > and institutional mandates convergent and collaborative instead
> > of divergent and competitive, this will motivate and facilitate adoption
> > and compliance with institutional mandates: institutions are the universal
> > providers of all research output, funded and unfunded.
> >
> > (3) IRs should mandate immediate deposit irrespective of publisher
> > OA policy: If authors wish to honor publisher OA embargoes, they
> > can set access to the deposit as Closed Access during the embargo
> > and rely on providing almost-OA via the IR's email eprint request button
> >
> > (4) Funder mandates should require deposit by the fundee -- the one
> > bound by the mandate -- rather than by the publisher, who is not
> > bound by the mandate, and indeed in conflict of interest with it.
> > http://openaccess.eprints.org/index.php?/archives/876-.html
> >
> > (5) Publishers (partly to protect from rival publisher free-loading,
> > partly to discourage funder mandates, and partly out of simple
> > misunderstanding of network capability) are much more likely
> > to endorse immediate institutional self-archiving than institution-external
> > deposit. This yet another reason funders should mandate institutional
> > deposit and metadata harvesting instead of direct institution-external deposit.
> >
> > Stevan Harnad
> >
> >
> > _______________________________________________
> > GOAL mailing list
> > GOAL at eprints.org
> > http://mailman.ecs.soton.ac.uk/mailman/listinfo/goal
>
>
> _______________________________________________
> GOAL mailing list
> GOAL at eprints.org
> http://mailman.ecs.soton.ac.uk/mailman/listinfo/goal



More information about the GOAL mailing list