[EP-tech] advanced search doesn't work with utf-8 characters

Dobrica Pavlinusic dpavlin at rot13.org
Fri Jul 5 14:43:27 BST 2013


I have problem with utf-8 characters in advanced search. None of queries
which contain utf-8 characters (in Croatia we have few of them: šđčćž)
produce any results.

I have read through wiki and this mail list and figured out that
$EPrints::Index::FREETEXT_CHAR_MAPPING might be to blame. I added
mapping for our characters but it didn't help (it would be nice to have
full support for all characters without need to edit eprints source).

Digging around through eprints source code, I noticed that my queries
are split on utf-8 characters. If I uncomment line in Eprints::Search
with $self->get_conditions->describe I can see following behaviour:

1. search query: "Agić" (utf-8 as last char)

AND(
        =($archive.metadata_visibility,"show") ... eprint,
        =($archive.eprint_status,"archive") ... eprint,
        index($archive.creators_name,"agi") ... eprint__rindex
)

As you can see, utf-8 character gets dropped and this doesn't produce
any results. I did check in eprint__rindex table and I do have "agić" in
there.

2. search query: "Bolanča" (utf-8 is next-to last char)

AND(
        =($archive.metadata_visibility,"show") ... eprint,
        =($archive.eprint_status,"archive") ... eprint,
        AND(
                grep($archive.creators_name,"%[bolan]%[a]%-%") ... eprint__index_grep,
                AndSubQuery(
                        index($archive.creators_name,"bolan") ... eprint__rindex,
                        index($archive.creators_name,"a") ... eprint__rindex
                )
        )
)

This is even worse, because it split search query into two queries on
utf-8 character.

I spent last three days inserting warns here-and-there in source code in
an effort to find out where this splitting is happending, but I have hit
the brick wall with this problem.

I would appriciate any info or pointers how to resolve this problem.

-- 
Dobrica Pavlinusic               2share!2flame            dpavlin at rot13.org
Unix addict. Internet consultant.             http://www.rot13.org/~dpavlin



More information about the Eprints-tech mailing list