EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #09688


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

Re: [EP-tech] Index/Tokenizer problem (RHEL 8, perl 5.26)


CAUTION: This e-mail originated outside the University of Southampton.

Dear all,

Here a follow-up:

 

After two full days of debugging and trying  out many variants and getting more gray hair, we think it is a problem how the hash $EPrints::Index::FREETEXT_CHAR_MAPPING in Index/Tokenizer.pm is addressed.

This behaves completely erratically, sometimes š is translated to s, sometimes not. It is as sometimes the hash would not exist.

This problem is observed when characters with UTF codepoint > 0x00ff are used (non-Ascii chars).

It might be that a “use 5.8.0” might remedy this (not tried out) by using the old Unicode implementation of perl.

 

However, we applied a solution now  that we also use cfg.d/optional_filename_sanitise.pl to transliterate file names and in several import plugins, which is much simpler and failsafe: Text::Unidecode

 

This library separates the upper and lower bytes of an UTF8 char and then adresses the transliteration tables, which are arrays, not hashes, by the respective integer value of the UTF8 bytes.

Since the transliteration tables are very extensive, maintaining $EPrints::Index::FREETEXT_CHAR_MAPPING is not necessary at all.

Also, it is possible to override the Text::Unidecode transliteration tables if one needs to. See https://metacpan.org/pod/Text::Unidecode

Also, I see that it’s part of the EPrints 3.3 package (but has been removed with EPrints 3.4).

 

Kind regards,

 

Martin

 

--

Dr. Martin Brändle
Zentrale Informatik
Universität Zürich
Pfingstweidstrasse 60B
CH-800
5 Zürich

 

 

From: eprints-tech-request@ecs.soton.ac.uk <eprints-tech-request@ecs.soton.ac.uk> on behalf of Martin Brändle <martin.braendle@uzh.ch>
Date: Monday, 19 February 2024 at 13:16
To: eprints-tech@ecs.soton.ac.uk <eprints-tech@ecs.soton.ac.uk>
Subject: [EP-tech] Index/Tokenizer problem (RHEL 8, perl 5.26)

CAUTION: This e-mail originated outside the University of Southampton.

CAUTION: This e-mail originated outside the University of Southampton.

Dear all,

 

We have detected an indexing problem with perl_lib/EPrints/Index/Tokenizer.pm

 

Characters which are above the ASCII table (UTF-8 code point > 0x00ff) are not translated correctly for creating the words in the reverse index, although they are listed in the $EPrints::Index::FREETEXT_CHAR_MAPPING map.

 

The reverse index (eprint__rindex) for one of the author names having a special character is now a mixture of both versions, e.g. Bzdušek vs. Bzdusek. If we reindex one of the older records, the reverse index entry it is reverted from Bzdusek to Bzdušek. 

 

If we search with Bzdušek, the records are not found.

 

We assume that this exists since we upgraded to RHEL 8 and perl 5.26.3

 

BTW: The Tokenizer code for EPrints 3.3 and EPrints 3.4 is quite different: 

https://github.com/eprints/eprints/blob/3.3/perl_lib/EPrints/Index/Tokenizer.pm

https://github.com/eprints/eprints3.4/blob/master/perl_lib/EPrints/Index/Tokenizer.pm

 

We have tried both versions, to no avail. 

 

Have others observed similar problems with perl 5.26 or higher? As far as I have seen from perl documentation, Unicode support has changed (e.g. :encoding has been deprecated and removed).

 

Kind regards,

 

Martin

 

 

--

Dr. Martin Brändle
Zentrale Informatik
Universität Zürich
Pfingstweidstrasse 60B
CH-800
5 Zürich

mail: martin.braendle@uzh.ch
phone: +41 44 63 56705
signature_2066573683https://orcid.org/0000-0002-7752-6567
https://www.zi.uzh.ch