[EP-tech] Re: large-scale repositories?

Andrew D Bell a.d.bell at ecs.soton.ac.uk
Mon Jun 9 10:18:54 BST 2014


The largest that I'm aware of (>300K) is http://discovery.ucl.ac.uk 
<http://discovery.ucl.ac.uk/>.

Andrew


On 09/06/14 10:04, sf2 wrote:
>
> The Uni of Southampton has over 100k records, the repo works fine.
>
> Bits that may not scale so well on 3.2/3.3:
>
> - Searching/Indexing: indexes are stored alongside your data, mysql 
> database - deep LEFT JOIN are generated if you're using many fields in 
> your simple search
>
> - Too many Compound/Multiple fields: each compound/multiple field adds 
> a DB auxilliary table (one extra READ or WRITE for each of those)
>
> - Views: crunching the "totals" is tricky over large filtered datasets 
> - also lots of sorting going on -> slow
>
> - Document relations: some bugs in EPrints 3.2 generates lots of 
> document relations (thumbnails etc) - clogs the DB
>
> - History: similarly some bugs in early 3.2's were generating far too 
> many "history" records (one DB record + one XML file on-disk) which 
> slows things down a lot
>
> Unlike Yuri, I don't recall any slow delivery of content - if you look 
> at Apache::Rewrite you'll see that EPrints releases the file to Apache 
> early in the request process - and that scales.
>
> FYI, I want to get rid of searching out of EPrints altogether and use 
> only Xapian: no more "search/indexes" data in your metadata database 
> -> lighter DB, searching/ordering done by a 1/3 party library we don't 
> need to maintain. Also Xapian offers lots of extras (facets, 
> suggestions, probability match...)
>
> Also, on my eprints4 branch on github you'll see a series of patches 
> to enable memory caching (via memcached) to read data records 
> (eprint,user..) from memory rather than from the DB (of course fall 
> backs to the DB when the record is modified). Untested on 3.3, may 
> work ;-)
>
> Seb
>
> On 09.06.2014 11:53, Yuri wrote:
>
>> Il 09/06/2014 10:09, Ian Stuart ha scritto:
>>> Are there any large-scale EPrints repos out there? (by large scale, 
>>> I mean 100,000+ accessible records)
>> we've about 40.000 record in two repository (with 10.000 record with
>> full text)
>>
>> I think the big problem is in Apache delivery files (also you've to tune
>> it for Perl and both static content...), there should be a away to serve
>> files without using perl, or in a minimal way. Another big problem is
>> updating views, takes a lot of time and I had to disable some of the
>> because it takes ages (days) do regenerate/update the view.
>>
>> The site is often at load 1, 1.5, most of the time serving pdfs outside.
>> It works but not perfect.
>>> The database technology will cope with up to 2 million records, but 
>>> I don't think the rest of EPrints will cope :D ... but what's in 
>>> use, in practice?
>> *** Options:http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
>> *** Archive:http://www.eprints.org/tech.php/
>> *** EPrints community wiki:http://wiki.eprints.org/
>> *** EPrints developers Forum:http://forum.eprints.org/
>
>
>
> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
> *** Archive: http://www.eprints.org/tech.php/
> *** EPrints community wiki: http://wiki.eprints.org/
> *** EPrints developers Forum: http://forum.eprints.org/


-- 
Andrew D Bell
EPrints Services
School of Electronics and Computer Science
University of Southampton
Southampton
SO17 1BJ

+44 (0)23 8059 8814
a.d.bell at ecs.soton.ac.uk

http://www.eprints.org/
http://eprintsservices.wordpress.com/
http://twitter.com/EPrintsServices

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20140609/7cc3756e/attachment-0001.html 


More information about the Eprints-tech mailing list