[EP-tech] Re: Antwort: Use of truncation in advanced searches
Gilles Fournié
gilles.fournie at cirad.fr
Tue Apr 21 13:37:05 BST 2015
Hi,
Thanks to both of you.
I have found the reason of the slowness with truncation.
It comes from the "COLLATE utf8_general_ci" used with LIKE.
If I run the query used by EPrints, it runs for 5 minutes :
mysql> SELECT eprint.eprintid FROM eprint LEFT JOIN
eprint__ordervalues_fr ON eprint.eprintid =
eprint__ordervalues_fr.eprintid ,
-> eprint__rindex AS eprint__rindex WHERE eprint.eprintid =
eprint__rindex.eprintid AND(
-> eprint.metadata_visibility = 'show' AND
eprint.eprint_status = 'archive' AND(
-> eprint__rindex.field = 'title' AND
eprint__rindex.word COLLATE utf8_general_ci LIKE 'thermograph%' OR
eprint__rindex.field = 'titre_parallele_titre' AND
eprint__rindex.word COLLATE utf8_general_ci LIKE 'thermograph%' OR
eprint__rindex.field = 'autre_titre_titre' AND eprint__rindex.word
COLLATE utf8_general_ci LIKE 'thermograph%'
-> )
-> ) GROUP BY eprint.eprintid ,
-> eprint__ordervalues_fr.date ,
-> eprint__ordervalues_fr.creators_name ,
-> eprint__ordervalues_fr.title ORDER BY
eprint__ordervalues_fr.date DESC ,
-> eprint__ordervalues_fr.creators_name ASC ,
-> eprint__ordervalues_fr.title ASC;
+----------+
| eprintid |
+----------+
| 513588 |
| 477759 |
| 476746 |
| 426892 |
+----------+
4 rows in set (5 min 33.61 sec)
But without the COLLATE part, it returns immediatly...
mysql> SELECT eprint.eprintid FROM eprint LEFT JOIN
eprint__ordervalues_fr ON eprint.eprintid =
eprint__ordervalues_fr.eprintid ,
-> eprint__rindex AS eprint__rindex WHERE eprint.eprintid =
eprint__rindex.eprintid AND(
-> eprint.metadata_visibility = 'show' AND
eprint.eprint_status = 'archive' AND(
-> eprint__rindex.field = 'title' AND
eprint__rindex.word LIKE 'thermograph%' OR
-> eprint__rindex.field = 'titre_parallele_titre' AND
eprint__rindex.word LIKE 'thermograph%' OR
-> eprint__rindex.field = 'autre_titre_titre' AND
eprint__rindex.word LIKE 'thermograph%'
-> )
-> ) GROUP BY eprint.eprintid ,
-> eprint__ordervalues_fr.date ,
-> eprint__ordervalues_fr.creators_name ,
-> eprint__ordervalues_fr.title ORDER BY
eprint__ordervalues_fr.date DESC ,
-> eprint__ordervalues_fr.creators_name ASC ,
-> eprint__ordervalues_fr.title ASC
-> ;
+----------+
| eprintid |
+----------+
| 513588 |
| 477759 |
| 476746 |
| 426892 |
+----------+
4 rows in set (0.00 sec)
Just for testing, I have tried modifying EPrints/Database/mysql.pm
(function sql_LIKE) :
* changing
o return " COLLATE utf8_general_ci LIKE ";
* to
o return ' LIKE ';
Now, queries from advanced search run fast even with truncations.
However, I fear that this modification has side effects !
What do you think about it ?
Is it safe to go on without COLLATE utf8_general_ci ?
(Note: the eprint__rindex table is defined with this same collation)
Thanks for your advices,
Regards,
Gilles
Le 21/04/2015 11:41, John Salter a écrit :
>
> Hi Giles,
>
> One thing we did on our platform was the change described here:
>
> https://github.com/eprints/eprints/issues/120 (see also:
> http://www.eprints.org/tech.php/18677.html).
>
> The description of this issue is around simple searches (when you’re
> not using Xapian), but I don’t know if it would have any effect on the
> advanced searches too.
>
> In some circumstances, MySQL takes longer to work out the best way to
> run a query than it would take just running the query inefficiently.
> This change makes it try less iterations to find the ‘best’ way to run
> the query.
>
> Might be worth a try?
>
> Cheers,
>
> John
>
> *From:*eprints-tech-bounces at ecs.soton.ac.uk
> [mailto:eprints-tech-bounces at ecs.soton.ac.uk] *On Behalf Of
> *martin.braendle at id.uzh.ch
> *Sent:* 21 April 2015 10:19
> *To:* eprints-tech at ecs.soton.ac.uk
> *Subject:* [EP-tech] Antwort: Use of truncation in advanced searches
>
> Hi Gilles,
>
> our repo has about 80'000 records and 56% fulltext, so is comparable
> to yours.
>
> Advanced search of thermograph* in
>
> title: immediate (1-2 seconds)
> documents (full text): 20-30 seconds. The mysql daemon goes up to
> 70-100% CPU load.
>
> Quick search (Xapian):
>
> title:thermograph* : immediate
> thermograph* : immediate
>
> We recommend in our help page (http://www.zora.uzh.ch/help/) that
> Quick Search should be the tool of choice and only for very precise
> searches Advanced Search should be used.
>
>
> From a recent debug session (on another issue) I know that EPrints
> translates behind the scenes an advanced search query into a series of
> dozens of complicated SQL statements. It might be that for certain
> cases these are not optimized.
>
> If it were that simple as
>
> select distinct ei.eprintid from eprint__rindex ei, eprint e where
> ei.field='documents' and ei.word like 'thermograph%' and
> e.eprint_status='archive' and e.eprintid=ei.eprintid;
>
> then that query would be answered in a fraction of a second. But it
> isn't, and can't be, and EPrints software engineers surely have put a
> lot of effort into the EPrints database engine part to cover all
> possible situations.
>
> Best regards,
>
> Martin
>
> --
> Dr. Martin Brändle
> Zentrale Informatik
> Universität Zürich
> Winterthurerstr. 190
> CH-8057 Zürich
>
>
> Inactive hide details for Gilles Fournié ---21/04/2015 10:22:27---Hi,
> I have a question about right-hand truncation in advancedGilles
> Fournié ---21/04/2015 10:22:27---Hi, I have a question about
> right-hand truncation in advanced searches.
>
> Von: Gilles Fournié <gilles.fournie at cirad.fr
> <mailto:gilles.fournie at cirad.fr>>
> An: eprints-tech at ecs.soton.ac.uk <mailto:eprints-tech at ecs.soton.ac.uk>
> Datum: 21/04/2015 10:22
> Betreff: [EP-tech] Use of truncation in advanced searches
> Gesendet von: eprints-tech-bounces at ecs.soton.ac.uk
> <mailto:eprints-tech-bounces at ecs.soton.ac.uk>
>
> ------------------------------------------------------------------------
>
>
>
>
> Hi,
>
> I have a question about right-hand truncation in advanced searches.
>
> If we search for (in title for example) :
>
> thermography
>
> the search runs for 1 to 3 seconds before returning results.
>
> If we extend our search to :
>
> thermography thermographie
>
> the search time is about the same.
>
> But if we try to use a wildcard :
>
> thermograph*
>
> the search takes a very long time (counts in minutes) !
>
> Does somebody have experienced such delays ?
> Any clues about what we can do to solve this problem ?
>
> (our archive contains ~ 91000 eprints)
>
> Best regards,
> GF
> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
> *** Archive: http://www.eprints.org/tech.php/
> *** EPrints community wiki: http://wiki.eprints.org/
> *** EPrints developers Forum: http://forum.eprints.org/
>
>
>
> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
> *** Archive: http://www.eprints.org/tech.php/
> *** EPrints community wiki: http://wiki.eprints.org/
> *** EPrints developers Forum: http://forum.eprints.org/
More information about the Eprints-tech
mailing list