[EP-tech] Re: Normalize characters for correct sorting
Ian Stuart
Ian.Stuart at ed.ac.uk
Tue Jun 9 11:34:51 BST 2015
Ah - OK.... yes, I had a similar problem a few years ago
It looks like
http://search.cpan.org/~kiz/MathML-Entities-Approximate-0.20/lib/MathML/Entities/Approximate.pm
should be updated, and it could be used by the Tokenizer :)
On 09/06/15 09:59, pgasinos pgs wrote:
> Hi Ian
>
> I probably didn't make myself clear what the real problem is. In English
> you don't have the same vowel with and without accent. It is only matter
> of correct spelling. So it is the same letter and has to be normalized
> to be sorted correctly. If you see Tokenizer.pm
> (/perl_lib/EPrints/Index/Tokenizer.pm) does the same for indexing.
>
> Kostas
>
> 2015-06-09 10:57 GMT+03:00 Ian Stuart <Ian.Stuart at ed.ac.uk
> <mailto:Ian.Stuart at ed.ac.uk>>:
>
> I suspect this is a Perl problem rather than an EPrints problem..... I
> would expect Perl to sort by Unicode Value (so 0386 before 0391)
>
> On 09/06/15 08:40, pgasinos pgs wrote:
> > Is there any configuration file(s) in Eprints that someone can
> normalize
> > utf-8 characters so they are sorting correctly in non English
> languages?
> > For example the Unicode entities: Ƃ GREEK CAPITAL LETTER ALPHA
> > WITH TONOS and
> > Ƈ GREEK CAPITAL LETTER ALPHA are the same and they have to be
> > sorted together, not in separate lists.
> > The vowels are even more complicated. All below, are the same
> letter and
> > they have to be in the same list:
> > υ υ GREEK SMALL LETTER UPSILON
> > ύ ύ GREEK SMALL LETTER UPSILON WITH TONOS
> > ϋ ϋ GREEK SMALL LETTER UPSILON WITH DIALYTIKA
> > ΰ ΰ GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
--
Ian Stuart.
Developer: ORI, RJ-Broker, and OpenDepot.org
Bibliographics and Multimedia Service Delivery team,
EDINA,
The University of Edinburgh.
http://edina.ac.uk/
This email was sent via the University of Edinburgh.
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
More information about the Eprints-tech
mailing list