<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; "><div><div>Important feedback from Tim Brody, one of the developers of EPrints:</div><div><br></div><div>Begin forwarded message:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px;"><span style="font-family:'Helvetica'; font-size:medium; color:rgba(0, 0, 0, 1);"><b>From: </b></span><span style="font-family:'Helvetica'; font-size:medium;">Tim Brody <<a href="mailto:tdb2@ecs.soton.ac.uk">tdb2@ecs.soton.ac.uk</a>><br></span></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px;"><span style="font-family:'Helvetica'; font-size:medium; color:rgba(0, 0, 0, 1);"><b>Date: </b></span><span style="font-family:'Helvetica'; font-size:medium;">February 17, 2012 6:33:22 AM EST<br></span></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px;"><span style="font-family:'Helvetica'; font-size:medium; color:rgba(0, 0, 0, 1);"><b>To: </b></span><span style="font-family:'Helvetica'; font-size:medium;"><a href="mailto:eprints-tech@ecs.soton.ac.uk">eprints-tech@ecs.soton.ac.uk</a><br></span></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px;"><span style="font-family:'Helvetica'; font-size:medium; color:rgba(0, 0, 0, 1);"><b>Cc: </b></span><span style="font-family:'Helvetica'; font-size:medium;"><a href="mailto:JISC-REPOSITORIES@JISCMAIL.AC.UK">JISC-REPOSITORIES@JISCMAIL.AC.UK</a><br></span></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px;"><span style="font-family:'Helvetica'; font-size:medium; color:rgba(0, 0, 0, 1);"><b>Subject: </b></span><span style="font-family:'Helvetica'; font-size:medium;"><b>[EP-tech] Re: Google Scholar discoverability of repository content</b><br></span></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px;"><span style="font-family:'Helvetica'; font-size:medium;"><font class="Apple-style-span" color="#000000"><b>--------</b></font></span></div><br><div>Hi All,<br><br>Here is some specific advice for existing repository administrators from<br>Google Scholar:<br><a href="http://roar.eprints.org/help/google_scholar.html">http://roar.eprints.org/help/google_scholar.html</a><br><br>As far as I'm aware there isn't anyone running EPrints 2 now, so<br>EPrints-based repositories are already (and for a long) the "best in<br>class" for Google Scholar.<br><br><br>Right, this paper ...<br><br>Table 1 is irrelevant and misleading. Scholar links first to the<br>publisher and, only if there is no publisher link, directly to the IR<br>version. That's a policy decision on the part of Scholar and nothing to<br>do with IRs.<br><br>Table 2 gives us some useful data. The headline rate for EPrints is 88%<br>(based on CalTech). Unfortunately the authors haven't provided an<br>analysis of what happened to the missing records. I've done a quick<br>random sample of CalTech and I suspect the missing records will consist<br>of:<br>1) Non-OA/non-full-text records (I'm sure a query to the CalTech<br>repository admin could supply the data).<br>2) A percentage of PDFs that Scholar won't be able to parse. CalTech<br>contains some old (1950s), scanned PDFs from Journals. Where the article<br>isn't at the top of the page Scholar will struggle to parse the<br>title/authors/abstract and therefore won't be able to match it to their<br>records e.g. http://authors.library.caltech.edu/5815/<br><br><br>The remainder of the paper describes the authors' process of fixing<br>their own IR (based on CONTENTdm).<br><br><br>The authors then wrongly conclude:<br><br>"Despite GS’s endorsement of three software packages, the surveys<br>conducted for this paper demonstrates that software is not a deciding<br>factor for indexing ratio in GS. Each of the three recommended software<br>packages showed good indexing ratios for some repositories and poor<br>ratios for others."<br><br>The authors looked at one instance of EPrints and, despite being a<br>relatively old version, found 88% of its records indexed in GS.<br><br>It is unfortunate that this paper has suggested that IR software in<br>general is poorly indexed in GS. On the contrary, some badly implemented<br>IR software is poorly indexed in GS.<br><br><br>After all that is said, the most critical factor to IR visibility is<br>having (BOAI definition) open access content. Hiding content behind<br>search forms, click-throughs and other things that emphasise the IR at<br>the expense of the content will hurt your visibility.<br><br>Lastly, Google will index your metadata-only records while Google<br>Scholar is looking for full-texts. Your GS/Google ratio will approximate<br>how many of your records have an attached open access PDF (.doc etc).<br><br><br>Sincerely,<br>Tim Brody<br>(EPrints Developer)<br><br>On Wed, 2012-02-15 at 11:31 +0000, Stevan Harnad wrote:<br><blockquote type="cite">Can we enhance the google-scholar discoverability of EPrints (and<br></blockquote><blockquote type="cite">DSpace) repositories?<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">http://linksource.ebsco.com/linking.aspx?sid=google&auinit=K&aulast=Arlitsch&atitle=Invisible+Institutional+Repositories:+Addressing+the+Low+Indexing+Ratios+of+IRs+in+Google+Scholar&title=Library+Hi+Tech&volume=30&issue=1&date=2012&spage=4&issn=0737-8831<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">Kenning Arlitsch, Patrick Shawn OBrien, (2012) "Invisible Institutional<br></blockquote><blockquote type="cite">Repositories: Addressing the Low Indexing Ratios of IRs in Google<br></blockquote><blockquote type="cite">Scholar", Library Hi Tech, Vol. 30 Iss: 1<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">Purpose - Google Scholar has difficulty indexing the contents of<br></blockquote><blockquote type="cite">institutional repositories, and the authors hypothesize the reason is<br></blockquote><blockquote type="cite">that most repositories use Dublin Core, which cannot express<br></blockquote><blockquote type="cite">bibliographic citation information adequately for academic papers.<br></blockquote><blockquote type="cite">Google Scholar makes specific recommendations for repositories,<br></blockquote><blockquote type="cite">including the use of publishing industry metadata schemas over Dublin<br></blockquote><blockquote type="cite">Core. This paper tests a theory that transforming metadata schemas in<br></blockquote><blockquote type="cite">institutional repositories will lead to increased indexing by Google<br></blockquote><blockquote type="cite">Scholar.<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">Design/methodology/approach - The authors conducted two surveys of<br></blockquote><blockquote type="cite">institutional and disciplinary repositories across the United States,<br></blockquote><blockquote type="cite">using different methodologies. They also conducted three pilot projects<br></blockquote><blockquote type="cite">that transformed the metadata of a subset of papers from USpace, the<br></blockquote><blockquote type="cite">University of Utah's institutional repository, and examined the results<br></blockquote><blockquote type="cite">of Google Scholar's explicit harvests.<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">Findings - Repositories that use GS recommended metadata schemas and<br></blockquote><blockquote type="cite">express them in HTML meta tags experienced significantly higher indexing<br></blockquote><blockquote type="cite">ratios. The ease with which search engine crawlers can navigate a<br></blockquote><blockquote type="cite">repository also seems to affect indexing ratio. The second and third<br></blockquote><blockquote type="cite">metadata transformation pilot projects at Utah were successful,<br></blockquote><blockquote type="cite">ultimately achieving an indexing ratio of greater than 90%. <br></blockquote><blockquote type="cite">Research limitations/implications - The second survey was limited to<br></blockquote><blockquote type="cite">forty titles from each of seven repositories, for a total of 280 titles.<br></blockquote><blockquote type="cite">A larger survey that covers more repositories may be useful.<br></blockquote><blockquote type="cite"><br></blockquote><blockquote type="cite">Practical implications - Institutional repositories are achieving<br></blockquote><blockquote type="cite">significant mass, and the rate of author citations from those<br></blockquote><blockquote type="cite">repositories may affect university rankings. Lack of visibility in<br></blockquote><blockquote type="cite">Google Scholar, however, will limit the ability of IRs to play a more<br></blockquote><blockquote type="cite">significant role in those citation rates.<br></blockquote><blockquote type="cite">Originality/value - Little or no research has been published about<br></blockquote><blockquote type="cite">improving the indexing ratio of institutional repositories in Google<br></blockquote><blockquote type="cite">Scholar. The authors believe that they are the first to address the<br></blockquote><blockquote type="cite">possibility of transforming IR metadata to improve indexing ratios in<br></blockquote><blockquote type="cite">Google Scholar.<br></blockquote><blockquote type="cite">*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech<br></blockquote><blockquote type="cite">*** Archive: http://www.eprints.org/tech.php/<br></blockquote><blockquote type="cite">*** EPrints community wiki: http://wiki.eprints.org/<br></blockquote><br><br>*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech<br>*** Archive: http://www.eprints.org/tech.php/<br>*** EPrints community wiki: http://wiki.eprints.org/<br></div></blockquote></div><br></body></html>