[EP-tech] Re: large-scale repositories?
Andrew D Bell
a.d.bell at ecs.soton.ac.uk
Mon Jun 9 10:18:54 BST 2014
The largest that I'm aware of (>300K) is http://discovery.ucl.ac.uk
<http://discovery.ucl.ac.uk/>.
Andrew
On 09/06/14 10:04, sf2 wrote:
>
> The Uni of Southampton has over 100k records, the repo works fine.
>
> Bits that may not scale so well on 3.2/3.3:
>
> - Searching/Indexing: indexes are stored alongside your data, mysql
> database - deep LEFT JOIN are generated if you're using many fields in
> your simple search
>
> - Too many Compound/Multiple fields: each compound/multiple field adds
> a DB auxilliary table (one extra READ or WRITE for each of those)
>
> - Views: crunching the "totals" is tricky over large filtered datasets
> - also lots of sorting going on -> slow
>
> - Document relations: some bugs in EPrints 3.2 generates lots of
> document relations (thumbnails etc) - clogs the DB
>
> - History: similarly some bugs in early 3.2's were generating far too
> many "history" records (one DB record + one XML file on-disk) which
> slows things down a lot
>
> Unlike Yuri, I don't recall any slow delivery of content - if you look
> at Apache::Rewrite you'll see that EPrints releases the file to Apache
> early in the request process - and that scales.
>
> FYI, I want to get rid of searching out of EPrints altogether and use
> only Xapian: no more "search/indexes" data in your metadata database
> -> lighter DB, searching/ordering done by a 1/3 party library we don't
> need to maintain. Also Xapian offers lots of extras (facets,
> suggestions, probability match...)
>
> Also, on my eprints4 branch on github you'll see a series of patches
> to enable memory caching (via memcached) to read data records
> (eprint,user..) from memory rather than from the DB (of course fall
> backs to the DB when the record is modified). Untested on 3.3, may
> work ;-)
>
> Seb
>
> On 09.06.2014 11:53, Yuri wrote:
>
>> Il 09/06/2014 10:09, Ian Stuart ha scritto:
>>> Are there any large-scale EPrints repos out there? (by large scale,
>>> I mean 100,000+ accessible records)
>> we've about 40.000 record in two repository (with 10.000 record with
>> full text)
>>
>> I think the big problem is in Apache delivery files (also you've to tune
>> it for Perl and both static content...), there should be a away to serve
>> files without using perl, or in a minimal way. Another big problem is
>> updating views, takes a lot of time and I had to disable some of the
>> because it takes ages (days) do regenerate/update the view.
>>
>> The site is often at load 1, 1.5, most of the time serving pdfs outside.
>> It works but not perfect.
>>> The database technology will cope with up to 2 million records, but
>>> I don't think the rest of EPrints will cope :D ... but what's in
>>> use, in practice?
>> *** Options:http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
>> *** Archive:http://www.eprints.org/tech.php/
>> *** EPrints community wiki:http://wiki.eprints.org/
>> *** EPrints developers Forum:http://forum.eprints.org/
>
>
>
> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
> *** Archive: http://www.eprints.org/tech.php/
> *** EPrints community wiki: http://wiki.eprints.org/
> *** EPrints developers Forum: http://forum.eprints.org/
--
Andrew D Bell
EPrints Services
School of Electronics and Computer Science
University of Southampton
Southampton
SO17 1BJ
+44 (0)23 8059 8814
a.d.bell at ecs.soton.ac.uk
http://www.eprints.org/
http://eprintsservices.wordpress.com/
http://twitter.com/EPrintsServices
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20140609/7cc3756e/attachment-0001.html
More information about the Eprints-tech
mailing list