EPrints Technical Mailing List Archive

Message: #04298


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

[EP-tech] Re: Normalize characters for correct sorting


Ah - OK.... yes, I had a similar problem a few years ago

It looks like http://search.cpan.org/~kiz/MathML-Entities-Approximate-0.20/lib/MathML/Entities/Approximate.pm should be updated, and it could be used by the Tokenizer :)


On 09/06/15 09:59, pgasinos pgs wrote:
Hi Ian

I probably didn't make myself clear what the real problem is. In English
you don't have the same vowel with and without accent. It is only matter
of correct spelling. So it is the same letter and has to be normalized
to be sorted correctly. If you see Tokenizer.pm
(/perl_lib/EPrints/Index/Tokenizer.pm) does the same for indexing.

Kostas

2015-06-09 10:57 GMT+03:00 Ian Stuart <Ian.Stuart@ed.ac.uk
<mailto:Ian.Stuart@ed.ac.uk>>:

    I suspect this is a Perl problem rather than an EPrints problem..... I
    would expect Perl to sort by Unicode Value (so 0386 before 0391)

    On 09/06/15 08:40, pgasinos pgs wrote:
     > Is there any configuration file(s) in Eprints that someone can
    normalize
     > utf-8 characters so they are sorting correctly in non English
    languages?
     > For example the Unicode entities: &#0386; GREEK CAPITAL LETTER ALPHA
     > WITH TONOS and
     > &#0391; GREEK CAPITAL LETTER ALPHA are the same and they have to be
     > sorted together, not in separate lists.
     > The vowels are even more complicated. All below, are the same
    letter and
     > they have to be in the same list:
     > υ    &#965;  GREEK SMALL LETTER UPSILON
     > ύ    &#973;  GREEK SMALL LETTER UPSILON WITH TONOS
     > ϋ    &#971;  GREEK SMALL LETTER UPSILON WITH DIALYTIKA
     > ΰ    &#944;  GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS


--

Ian Stuart.
Developer: ORI, RJ-Broker, and OpenDepot.org
Bibliographics and Multimedia Service Delivery team,
EDINA,
The University of Edinburgh.

http://edina.ac.uk/

This email was sent via the University of Edinburgh.

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.