[EP-tech] Re: international character search problem
Tommy Ingulfsen
tommy at library.caltech.edu
Thu Jan 24 17:20:55 GMT 2013
Hi and thanks for putting up a patch on Git so quickly. I'm sorry to say
that I ran into another problem when I patched our server with the new
perl_lib/EPrints/MetaField/Name.pm. Previously, the regular expression
that splits up initials was located after the test for whether we're doing
a simple search (as opposed to an advanced search) - this is the new
version of the code I'm talking about:
# split up initials
$v2 =~ s/([\p{Uppercase}])/ $1/g;
# name searches are case sensitive
$v2 = "\L$v2";
if( $search_mode eq "simple" )
{
return EPrints::Search::Condition->new(
$indexmode,
$dataset,
$self,
$v2 );
}
Now, if I do a simple search for e.g. "James", the splitting up of
initials above causes a search for " James" to be performed, which doesn't
work so well. I'm not entirely sure what the intention of all of the code
is, so I don't have a fix for this myself yet.
There was another, unrelated, issue I came across while debugging. In the
table eprint__rindex, I noticed that some of the non-ASCII characters in
creators_name are stored correctly - e.g. "zenginoğlu". But then there are
some authors whose names don't come through right. For example, when I
entered a new paper written by "Magó", the creators_name is stored as
"mago" in eprint__rindex.word. Another example I found is "Eötvös", which
is stored as "eoetvoes". I haven't looked into this one in detail myself
yet, so I don't have any pointers as to what the cause may be.
Anyway, the first search issue is more pressing for us, so if anyone on
the list has any ideas for a robust solution that would be great.
Regards
Tommy, Caltech
On 1/17/13 4:38 AM, "Tim Brody" <tdb2 at ecs.soton.ac.uk> wrote:
>On Thu, 17 Jan 2013 00:46:37 +0000, Tommy Ingulfsen
><tommy at library.caltech.edu> wrote:
>> I may have found a bug in EPrints 3.3.10. One of the authors in our
>> repository is Anıl Zenginoğlu (if the name doesn't come out right in
>> email, his homepage is http://www.tapir.caltech.edu/~anil/). Searching
>> for the surname works fine with the simple search, but with the advanced
>> search we don't get any results. I believe the problem is with line 230
>in
>> perl_lib/EPrints/MetaField/Name.pm:
>>
>> # remove not a-z characters (except ,)
>> $v2 =~ s/[^a-z,]/ /ig;
>>
>> That code splits up "zenginoğlu" to "zengino lu". A possible solution
>may
>> be
>>
>> use utf8;
>> …
>> $v2 =~ s/[^\p{L},]/ /ig;
>> …
>>
>> Maybe someone with a strong encodings-fu can comment?
>
>Hi,
>
>I've written a fix here:
>https://github.com/eprints/eprints/issues/13
>
>--
>All the best,
>Tim.
>*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
>*** Archive: http://www.eprints.org/tech.php/
>*** EPrints community wiki: http://wiki.eprints.org/
More information about the Eprints-tech
mailing list