EPrints Technical Mailing List Archive

Message: #03127


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

[EP-tech] Re: large-scale repositories?


 

The Uni of Southampton has over 100k records, the repo works fine.

Bits that may not scale so well on 3.2/3.3:

- Searching/Indexing: indexes are stored alongside your data, mysql database - deep LEFT JOIN are generated if you're using many fields in your simple search

- Too many Compound/Multiple fields: each compound/multiple field adds a DB auxilliary table (one extra READ or WRITE for each of those)

- Views: crunching the "totals" is tricky over large filtered datasets - also lots of sorting going on -> slow

- Document relations: some bugs in EPrints 3.2 generates lots of document relations (thumbnails etc) - clogs the DB

- History: similarly some bugs in early 3.2's were generating far too many "history" records (one DB record + one XML file on-disk) which slows things down a lot

 

Unlike Yuri, I don't recall any slow delivery of content - if you look at Apache::Rewrite you'll see that EPrints releases the file to Apache early in the request process - and that scales.

FYI, I want to get rid of searching out of EPrints altogether and use only Xapian: no more "search/indexes" data in your metadata database -> lighter DB, searching/ordering done by a 1/3 party library we don't need to maintain. Also Xapian offers lots of extras (facets, suggestions, probability match...)

Also, on my eprints4 branch on github you'll see a series of patches to enable memory caching (via memcached) to read data records (eprint,user..) from memory rather than from the DB (of course fall backs to the DB when the record is modified). Untested on 3.3, may work ;-)

 

Seb

 

 

On 09.06.2014 11:53, Yuri wrote:

Il 09/06/2014 10:09, Ian Stuart ha scritto:
Are there any large-scale EPrints repos out there? (by large scale, I mean 100,000+ accessible records)
we've about 40.000 record in two repository (with 10.000 record with 
full text)

I think the big problem is in Apache delivery files (also you've to tune 
it for Perl and both static content...), there should be a away to serve 
files without using perl, or in a minimal way. Another big problem is 
updating views, takes a lot of time and I had to disable some of the 
because it takes ages (days) do regenerate/update the view.

The site is often at load 1, 1.5, most of the time serving pdfs outside. 
It works but not perfect.
The database technology will cope with up to 2 million records, but I don't think the rest of EPrints will cope :D ... but what's in use, in practice?
*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: http://www.eprints.org/tech.php/
*** EPrints community wiki: http://wiki.eprints.org/
*** EPrints developers Forum: http://forum.eprints.org/