[GOAL] Re: Paperity launched. The 1st multidisciplinary aggregator of OA journals & papers

Peter Murray-Rust pm286 at cam.ac.uk
Sun Oct 12 16:43:24 BST 2014


On Sun, Oct 12, 2014 at 1:44 PM, Jan Velterop <velterop at gmail.com> wrote:

>
> On 12 Oct 2014, at 12:51, Stevan Harnad <harnad at ecs.soton.ac.uk> wrote:
>
> Harvesting Gold OA journal articles is a piece of cake.
>
>
> Indeed. Not just for Paperity, but for anybody else. It's one of the
> attractions and benefits of open access via the 'gold' route.
>

Yes,

It's noteworthy that almost all modern text and data mining exercises are
carried out on the Open Access subset of the literature. In some cases this
is an attempt to get the whole Open literature - in others it's a subsubset
such as EuropePubMedCentral. (The alternatives to this are (a) to ignore
rights and mine anyway - something we are legally allowed to do in the UK
but almost nowhere else or (b) do in in private hoping you won't be found
and scared of publishing your sources as a good scholar should).

Another is that most articles can be harvested in XML-format, which enables
> sophisticated and worthwhile services to be added to aggregations.
>

This is true for born-Open publishers such as BioMedCentral, PLOS*, eLIfe,
PeerJ, Ubiquity ... This is a straightforward sale - author payment =>
freedom for re-use. It works very well for text miners. (And please don't
tell us that mining is a minority sport which has to tread water for
another 5-10 years).

I have not systematically surveyed whether XML is offered in the "Gold"
Open Access journals of other major publishers nor whether the licence is
always permissive. Those people who argue that CC-NC-ND protects authors
(it doesn't) should realise that it has a massive negative impact on useful
re-use including mining.

Hybrid journals almost certainly do not offer XML. It's hard enough for
them to offer CC-BY for "Open Access".

It works less well for born-Closed publishers (such as Elsevier, NPG, ACS,
etc.). Rather than having the simple

And aggregations enable researchers to conveniently make large-scale
> pattern- and meta-analyses without first having to gather all the material
> from different and disparate sources.
>

Yes - we have built the apparatus to do this in contentmine.org


> Few 'green' repositories that I'm aware of have XML-versions (correct me
> if I'm wrong – and should I be wrong, is there a list of such
> repositories?). Aggregations, by the way, cannot be made without clarity
> about rights and licences, since they are a form of re-use. Those rights
> are clear, and properly included in metadata, for proper 'gold', but often
> not for 'green' versions of paywalled articles in repositories.
>

Exactly. Most "Green" repositories make it very hard to re-use material.
This is primarily due to copyright - the default library approach is to say
"this may be copyright and you cannot use it unless you write to the author
and get permission in writing with real ink". Then there is the technology.
University repositories are constructed on the basis that each document is
a priceless artefact that scholars will spend hours discovering and
reading. The reality of science is that most of these documents will
probably only be read by machines. Some counties (NL, FR for example) at
least aggregate some documents - such as theses - and the UK has CORE to
try to remedy the situation, but even so it's extremely difficult to index
and search repositories.

I wrote to Bernard Rentier offering to index his repository for scientific
terms but was told - sadly - that there was a new phase of investment
required before this would be possible.

Another problem with most repositories is that they insist on transforming
DOCX or LaTeX into PDF. Even for their own theses. This is an act of
barbarism. PDF has no semantics and it destroys about 50-75% of the science
in the document.

Anyway we expect to announce our own Open indexing of the literature RSN.


-- 
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/goal/attachments/20141012/734640f2/attachment.html 


More information about the GOAL mailing list