[EP-tech] Re: advanced search doesn't work with utf-8 characters

Dobrica Pavlinusic dpavlin at rot13.org
Tue Jul 9 13:17:03 BST 2013


On Mon, Jul 08, 2013 at 04:23:28PM +0000, Tommy Ingulfsen wrote:
> I think you may have come across the same problem that is described in
> this thread:
> 
> http://www.eprints.org/tech.php/thread-17424.html
> 
> Maybe you can try Tim's patch and see if that works for you?

Thank you very much for pointer to thread, Tim's patch indeed fixes this
problem.

As it is already part of next release is there ETA for 3.3.12?

> On 7/5/13 6:43 AM, "Dobrica Pavlinusic" <dpavlin at rot13.org> wrote:
> 
> >I have problem with utf-8 characters in advanced search. None of queries
> >which contain utf-8 characters (in Croatia we have few of them: šđčćž)
> >produce any results.
> >
> >I have read through wiki and this mail list and figured out that
> >$EPrints::Index::FREETEXT_CHAR_MAPPING might be to blame. I added
> >mapping for our characters but it didn't help (it would be nice to have
> >full support for all characters without need to edit eprints source).
> >
> >Digging around through eprints source code, I noticed that my queries
> >are split on utf-8 characters. If I uncomment line in Eprints::Search
> >with $self->get_conditions->describe I can see following behaviour:
> >
> >1. search query: "Agić" (utf-8 as last char)
> >
> >AND(
> >        =($archive.metadata_visibility,"show") ... eprint,
> >        =($archive.eprint_status,"archive") ... eprint,
> >        index($archive.creators_name,"agi") ... eprint__rindex
> >)
> >
> >As you can see, utf-8 character gets dropped and this doesn't produce
> >any results. I did check in eprint__rindex table and I do have "agić" in
> >there.
> >
> >2. search query: "Bolanča" (utf-8 is next-to last char)
> >
> >AND(
> >        =($archive.metadata_visibility,"show") ... eprint,
> >        =($archive.eprint_status,"archive") ... eprint,
> >        AND(
> >                grep($archive.creators_name,"%[bolan]%[a]%-%") ...
> >eprint__index_grep,
> >                AndSubQuery(
> >                        index($archive.creators_name,"bolan") ...
> >eprint__rindex,
> >                        index($archive.creators_name,"a") ...
> >eprint__rindex
> >                )
> >        )
> >)
> >
> >This is even worse, because it split search query into two queries on
> >utf-8 character.
> >
> >I spent last three days inserting warns here-and-there in source code in
> >an effort to find out where this splitting is happending, but I have hit
> >the brick wall with this problem.
> >
> >I would appriciate any info or pointers how to resolve this problem.
> >
> >-- 
> >Dobrica Pavlinusic               2share!2flame
> >dpavlin at rot13.org
> >Unix addict. Internet consultant.
> >http://www.rot13.org/~dpavlin
> >
> >*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
> >*** Archive: http://www.eprints.org/tech.php/
> >*** EPrints community wiki: http://wiki.eprints.org/
> 
> 
> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
> *** Archive: http://www.eprints.org/tech.php/
> *** EPrints community wiki: http://wiki.eprints.org/

-- 
Dobrica Pavlinusic               2share!2flame            dpavlin at rot13.org
Unix addict. Internet consultant.             http://www.rot13.org/~dpavlin



More information about the Eprints-tech mailing list