EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #09629


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

[EP-tech] Index/Tokenizer problem (RHEL 8, perl 5.26)


CAUTION: This e-mail originated outside the University of Southampton.

Dear all,

 

We have detected an indexing problem with perl_lib/EPrints/Index/Tokenizer.pm

 

Characters which are above the ASCII table (UTF-8 code point > 0x00ff) are not translated correctly for creating the words in the reverse index, although they are listed in the $EPrints::Index::FREETEXT_CHAR_MAPPING map.

 

The reverse index (eprint__rindex) for one of the author names having a special character is now a mixture of both versions, e.g. Bzdušek vs. Bzdusek. If we reindex one of the older records, the reverse index entry it is reverted from Bzdusek to Bzdušek. 

 

If we search with Bzdušek, the records are not found.

 

We assume that this exists since we upgraded to RHEL 8 and perl 5.26.3

 

BTW: The Tokenizer code for EPrints 3.3 and EPrints 3.4 is quite different: 

https://github.com/eprints/eprints/blob/3.3/perl_lib/EPrints/Index/Tokenizer.pm

https://github.com/eprints/eprints3.4/blob/master/perl_lib/EPrints/Index/Tokenizer.pm

 

We have tried both versions, to no avail. 

 

Have others observed similar problems with perl 5.26 or higher? As far as I have seen from perl documentation, Unicode support has changed (e.g. :encoding has been deprecated and removed).

 

Kind regards,

 

Martin

 

 

--

Dr. Martin Brändle
Zentrale Informatik
Universität Zürich
Pfingstweidstrasse 60B
CH-800
5 Zürich

mail: martin.braendle@uzh.ch
phone: +41 44 63 56705
signature_2066573683https://orcid.org/0000-0002-7752-6567
https://www.zi.uzh.ch