[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[EP-tech] Re: Antwort: Use of truncation in advanced searches



Hi,

Thanks to both of you.

I have found the reason of the slowness with truncation.
It comes from the "COLLATE utf8_general_ci" used with LIKE.

If I run the query used by EPrints, it runs for 5 minutes :

    mysql> SELECT eprint.eprintid FROM eprint LEFT JOIN
    eprint__ordervalues_fr ON eprint.eprintid =
    eprint__ordervalues_fr.eprintid ,
         -> eprint__rindex AS eprint__rindex WHERE eprint.eprintid =
    eprint__rindex.eprintid AND(
         ->     eprint.metadata_visibility = 'show' AND
    eprint.eprint_status = 'archive' AND(
         ->         eprint__rindex.field = 'title' AND
    eprint__rindex.word COLLATE utf8_general_ci LIKE 'thermograph%' OR
    eprint__rindex.field = 'titre_parallele_titre' AND
    eprint__rindex.word COLLATE utf8_general_ci LIKE 'thermograph%' OR
    eprint__rindex.field = 'autre_titre_titre' AND eprint__rindex.word
    COLLATE utf8_general_ci LIKE 'thermograph%'
         ->     )
         -> ) GROUP BY eprint.eprintid ,
         -> eprint__ordervalues_fr.date ,
         -> eprint__ordervalues_fr.creators_name ,
         -> eprint__ordervalues_fr.title ORDER BY
    eprint__ordervalues_fr.date DESC ,
         -> eprint__ordervalues_fr.creators_name ASC ,
         -> eprint__ordervalues_fr.title ASC;
    +----------+
    | eprintid |
    +----------+
    |   513588 |
    |   477759 |
    |   476746 |
    |   426892 |
    +----------+
    4 rows in set (5 min 33.61 sec)

But without the COLLATE part, it returns immediatly...

    mysql> SELECT eprint.eprintid FROM eprint LEFT JOIN
    eprint__ordervalues_fr ON eprint.eprintid =
    eprint__ordervalues_fr.eprintid ,
         -> eprint__rindex AS eprint__rindex WHERE eprint.eprintid =
    eprint__rindex.eprintid AND(
         ->     eprint.metadata_visibility = 'show' AND
    eprint.eprint_status = 'archive' AND(
         ->         eprint__rindex.field = 'title' AND
    eprint__rindex.word LIKE 'thermograph%' OR
         ->         eprint__rindex.field = 'titre_parallele_titre' AND
    eprint__rindex.word LIKE 'thermograph%' OR
         ->         eprint__rindex.field = 'autre_titre_titre' AND
    eprint__rindex.word LIKE 'thermograph%'
         ->     )
         -> ) GROUP BY eprint.eprintid ,
         -> eprint__ordervalues_fr.date ,
         -> eprint__ordervalues_fr.creators_name ,
         -> eprint__ordervalues_fr.title ORDER BY
    eprint__ordervalues_fr.date DESC ,
         -> eprint__ordervalues_fr.creators_name ASC ,
         -> eprint__ordervalues_fr.title ASC
         -> ;
    +----------+
    | eprintid |
    +----------+
    |   513588 |
    |   477759 |
    |   476746 |
    |   426892 |
    +----------+
    4 rows in set (0.00 sec)


Just for testing, I have tried modifying EPrints/Database/mysql.pm 
(function sql_LIKE) :

  * changing
      o return " COLLATE utf8_general_ci LIKE ";
  * to
      o return ' LIKE ';

Now, queries from advanced search run fast even with truncations.

However, I fear that this modification has side effects !
What do you think about it ?
Is it safe to go on without COLLATE utf8_general_ci ?
(Note: the eprint__rindex table is defined with this same collation)

Thanks for your advices,
Regards,
Gilles



Le 21/04/2015 11:41, John Salter a ?crit :
>
> Hi Giles,
>
> One thing we did on our platform was the change described here:
>
> https://github.com/eprints/eprints/issues/120 (see also: 
> http://www.eprints.org/tech.php/18677.html).
>
> The description of this issue is around simple searches (when you?re 
> not using Xapian), but I don?t know if it would have any effect on the 
> advanced searches too.
>
> In some circumstances, MySQL takes longer to work out the best way to 
> run a query than it would take just running the query inefficiently. 
> This change makes it try less iterations to find the ?best? way to run 
> the query.
>
> Might be worth a try?
>
> Cheers,
>
> John
>
> *From:*eprints-tech-bounces at ecs.soton.ac.uk 
> [mailto:eprints-tech-bounces at ecs.soton.ac.uk] *On Behalf Of 
> *martin.braendle at id.uzh.ch
> *Sent:* 21 April 2015 10:19
> *To:* eprints-tech at ecs.soton.ac.uk
> *Subject:* [EP-tech] Antwort: Use of truncation in advanced searches
>
> Hi Gilles,
>
> our repo has about 80'000 records and 56% fulltext, so is comparable 
> to yours.
>
> Advanced search of thermograph* in
>
> title: immediate (1-2 seconds)
> documents (full text): 20-30 seconds. The mysql daemon goes up to 
> 70-100% CPU load.
>
> Quick search (Xapian):
>
> title:thermograph*  : immediate
> thermograph* : immediate
>
> We recommend in our help page (http://www.zora.uzh.ch/help/) that 
> Quick Search should be the tool of choice and only for very precise 
> searches Advanced Search should be used.
>
>
> From a recent debug session (on another issue) I know that EPrints 
> translates behind the scenes an advanced search query into a series of 
> dozens of complicated SQL statements. It might be that for certain 
> cases these are not optimized.
>
> If it were that simple as
>
> select distinct ei.eprintid from eprint__rindex ei, eprint e  where 
> ei.field='documents' and ei.word like 'thermograph%' and 
> e.eprint_status='archive' and e.eprintid=ei.eprintid;
>
> then that query would be answered in a fraction of a second. But it 
> isn't, and can't be, and EPrints software engineers surely have put a 
> lot of effort into the EPrints database engine part to cover all 
> possible situations.
>
> Best regards,
>
> Martin
>
> --
> Dr. Martin Br?ndle
> Zentrale Informatik
> Universit?t Z?rich
> Winterthurerstr. 190
> CH-8057 Z?rich
>
>
> Inactive hide details for Gilles Fourni? ---21/04/2015 10:22:27---Hi, 
> I have a question about right-hand truncation in advancedGilles 
> Fourni? ---21/04/2015 10:22:27---Hi, I have a question about 
> right-hand truncation in advanced searches.
>
> Von: Gilles Fourni? <gilles.fournie at cirad.fr 
> <mailto:gilles.fournie at cirad.fr>>
> An: eprints-tech at ecs.soton.ac.uk <mailto:eprints-tech at ecs.soton.ac.uk>
> Datum: 21/04/2015 10:22
> Betreff: [EP-tech]  Use of truncation in advanced searches
> Gesendet von: eprints-tech-bounces at ecs.soton.ac.uk 
> <mailto:eprints-tech-bounces at ecs.soton.ac.uk>
>
> ------------------------------------------------------------------------
>
>
>
>
> Hi,
>
> I have a question about right-hand truncation in advanced searches.
>
> If we search for (in title for example) :
>
>     thermography
>
> the search runs for 1 to 3 seconds before returning results.
>
> If we extend our search to :
>
>     thermography thermographie
>
> the search time is about the same.
>
> But if we try to use a wildcard :
>
>     thermograph*
>
> the search takes a very long time (counts in minutes) !
>
> Does somebody have experienced such delays ?
> Any clues about what we can do to solve this problem ?
>
> (our archive contains ~ 91000 eprints)
>
> Best regards,
> GF
> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
> *** Archive: http://www.eprints.org/tech.php/
> *** EPrints community wiki: http://wiki.eprints.org/
> *** EPrints developers Forum: http://forum.eprints.org/
>
>
>
> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
> *** Archive: http://www.eprints.org/tech.php/
> *** EPrints community wiki: http://wiki.eprints.org/
> *** EPrints developers Forum: http://forum.eprints.org/