[BOAI] Unethical harvesters

Peter Suber peters at earlham.edu
Wed Oct 28 16:59:51 GMT 2009


[Forwarding from Arthur Sale.  --Peter Suber.]


I write to draw the list’s attention to unethical 
behaviour by a national harvester – the 
Australian Research Online gateway.  This 
gateway, operated by the National Library of 
Australia, has rejected the OAI-PMH standard and 
has announced a local variant. This sort of 
behaviour by harvesters must be firmly stamped on 
as soon as possible. International standards are 
to be complied with, not modified for budgetary convenience.

Responsibility

It is the responsibility of any OAI-PMH harvester 
such as ARO, ADT, ROAR, OpenDOAR, OAIster, etc to 
harvest correctly from all OAI-PMH-compliant 
repositories that exist in the wild and which it 
regard as its target group. Please examine that 
sentence carefully: the responsibility is with a 
gateway (which ARO is) to harvest from any 
compliant OAI-PMH interface, and not to 
misrepresent the data. The National Library fails on both counts.

Remember that international standards such as 
OAI-PMH are designed to permit global interchange 
of metadata. Any harvester that insists on some 
individual or local restriction of the 
international standard is irresponsible. I did 
not expect this of the National Library of 
Australia. So far it seems to be globally unique in this behaviour.

Why does it fail? In a nutshell, possible hubris 
and probable laziness. As to hubris, the NLA has 
produced a set of requirements for harvesting to 
which expects repositories to comply.  Requiring 
each repository to comply with its “requirements” 
rather than National Library of Australia (NLA) harvesting properly:

* multiplies the work as each Australian 
repository has to adapt its interface or opt-out 
(rather than the NLA doing the job properly once),

* introduces the chance of breaking an existing 
harvesting arrangement if the repository changes its interface, and

* would be absolutely fatal to the whole global 
enterprise if another harvester came up with incompatible requirements.

In the case of my university it would definitely 
break our in-house one-on-one harvesting for 
Government data reporting and would be likely to 
have similar flow on effects for our national PhD 
thesis harvesting at the very least. If all 
harvesters were to come up with idiosyncratic 
requirements, the world would be in a real mess 
and harvesting, not to mention search engines, 
would be infeasible. Just imagine if Google were 
to behave the same way in the html world! At most 
these ARO “requirements” constitute a set of suggestions.

The probable laziness comes from programmers. It 
is trivially easy to do a proper harvest from all 
the repositories that exist in Australia (there 
are not that many and even fewer softwares). I 
can think of at least two strategies, neither of 
which would take more than an hour of a competent 
programmer’s time. ADT and the rest of the 
world’s OAI harvesters can do it, why can’t the NLA?

“Best Practice”

I hesitated to write this section because some 
will think it is important. It isn’t. The main 
issue is the one above. However, it is bound to 
be raised by the NLA to justify their so-called 
“requirements”. This is the argument that their 
harvesting “requirements” are good practice. In 
fact it is not difficult to mount a case that the 
GNU EPrints scheme is better practice than the 
ARO scheme. Consider these quotes from the Dublin 
Core Initiative (the red is mine):


“4.14. Identifier

Label: Resource Identifier
Element Description: An unambiguous reference to 
the resource within a given context. Recommended 
best practice is to identify the resource by 
means of a string or number conforming to a 
formal identification system. Examples of formal 
identification systems include the Uniform 
Resource Identifier (URI) (including the Uniform 
Resource Locator (URL), the Digital Object 
Identifier (DOI) and the International Standard Book Number (ISBN).
Guidelines for content creation:
This element can also be used for local 
identifiers (e.g. ID numbers or call numbers) 
assigned by the Creator of the resource to apply 
to a particular item. It should not be used for 
identification of the metadata record itself.”

[Using Dublin Core - The Elements, 
<http://dublincore.org/documents/usageguide/elements.shtml>http://dublincore.org/documents/usageguide/elements.shtml] 



“3. Element Content and Controlled Vocabularies

Each Dublin Core element is optional and 
repeatable, and there is no defined order of 
elements. The ordering of multiple occurrences of 
the same element (e.g., Creator) may have a 
significance intended by the provider, but 
ordering is not guaranteed to be preserved in every user environment.”

[Using Dublin Core, 
<http://dublincore.org/documents/usageguide/>http://dublincore.org/documents/usageguide/] 


The NLA “requirements” specify that the relevant 
metadata must be in a dc:identifier field 
contrary to these guidelines. Further ARO 
“require” that the first dc:identifier element be 
the metadata identifier, despite clear indications that order does not matter.

Don’t get me wrong. I am not on a crusade to 
change the way repositories currently present 
their OAI-PMH elements, unlike ARO. I really 
don’t care much how they interpret the standards. 
But I do care about the NLA assuming such a 
bullying stance in relation to Australian 
repositories. Already at least two Australian 
repositories have confessed to changing their 
OAI-PMH interface to suit ARO! If this happens 
elsewhere, the consequences for open access are 
significant as incompatibilities are bound to arise.

Conclusions

1.  Readers of the list should be alert for 
similar unethical behaviour in their territories.
2.  ARO and the NLA should start harvesting from 
the Australian OAI-PMH interfaces correctly, as 
soon as possible, just as the rest of the world does.
3.  In the meantime, mis-harvested repositories 
should be withdrawn from the ARO gateway database.
4.  If ARO does not comply, Australian 
repositories will need to consider boycotting the service.

Arthur Sale
Emeritus Professor of Computer Science
University of Tasmania
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/boai-forum/attachments/20091028/14d72d6c/attachment.html 


More information about the Boai-forum mailing list