[EP-tech] Re: large-scale repositories?

sf2 sf2 at ecs.soton.ac.uk
Mon Jun 9 10:04:07 BST 2014


 

The Uni of Southampton has over 100k records, the repo works fine. 

Bits that may not scale so well on 3.2/3.3: 

- Searching/Indexing: indexes are stored alongside your data, mysql
database - deep LEFT JOIN are generated if you're using many fields in
your simple search 

- Too many Compound/Multiple fields: each compound/multiple field adds a
DB auxilliary table (one extra READ or WRITE for each of those) 

- Views: crunching the "totals" is tricky over large filtered datasets -
also lots of sorting going on -> slow 

- Document relations: some bugs in EPrints 3.2 generates lots of
document relations (thumbnails etc) - clogs the DB 

- History: similarly some bugs in early 3.2's were generating far too
many "history" records (one DB record + one XML file on-disk) which
slows things down a lot 

Unlike Yuri, I don't recall any slow delivery of content - if you look
at Apache::Rewrite you'll see that EPrints releases the file to Apache
early in the request process - and that scales. 

FYI, I want to get rid of searching out of EPrints altogether and use
only Xapian: no more "search/indexes" data in your metadata database ->
lighter DB, searching/ordering done by a 1/3 party library we don't need
to maintain. Also Xapian offers lots of extras (facets, suggestions,
probability match...) 

Also, on my eprints4 branch on github you'll see a series of patches to
enable memory caching (via memcached) to read data records
(eprint,user..) from memory rather than from the DB (of course fall
backs to the DB when the record is modified). Untested on 3.3, may work
;-) 

Seb 

On 09.06.2014 11:53, Yuri wrote: 

> Il 09/06/2014 10:09, Ian Stuart ha scritto:
> 
>> Are there any large-scale EPrints repos out there? (by large scale, I mean 100,000+ accessible records)
> 
> we've about 40.000 record in two repository (with 10.000 record with 
> full text)
> 
> I think the big problem is in Apache delivery files (also you've to tune 
> it for Perl and both static content...), there should be a away to serve 
> files without using perl, or in a minimal way. Another big problem is 
> updating views, takes a lot of time and I had to disable some of the 
> because it takes ages (days) do regenerate/update the view.
> 
> The site is often at load 1, 1.5, most of the time serving pdfs outside. 
> It works but not perfect.
> 
>> The database technology will cope with up to 2 million records, but I don't think the rest of EPrints will cope :D ... but what's in use, in practice?
> 
> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech [1]
> *** Archive: http://www.eprints.org/tech.php/ [2]
> *** EPrints community wiki: http://wiki.eprints.org/ [3]
> *** EPrints developers Forum: http://forum.eprints.org/ [4]

 

Links:
------
[1] http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
[2] http://www.eprints.org/tech.php/
[3] http://wiki.eprints.org/
[4] http://forum.eprints.org/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20140609/b1364153/attachment.html 


More information about the Eprints-tech mailing list