EPrints Technical Mailing List Archive

Message: #06900

< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

Re: [EP-tech] Problem with searching for names starting with Ö



Querying ”with the wrong keyboard” has always been an issue when non-english characters are involved.  I agree that the modern way to simply drop accents and everything “non-ASCII-7” solves most problems as you otherwise needs to know how the letter sounds which is far from obvious. 



Example, we translate the Swedish characters


“å” and “ä” both to “a”

“ö” to “o”


For combined characters/litagues like the Dansih  “æ” we substitute with both characters, in this case ”ae” as we feels this approach is most intuitive.


We get a few false positives in the querying but this is very seldom an issue. 


I have simply changed the Tokenizer.pm (and have a copy in our re-installation routines to replace this whenever we setup a new machine or what-not).  Don’t know of any other way.




From: eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk] On Behalf Of Liam Green-Hughes
Sent: den 24 oktober 2017 15:28
To: eprints-tech@ecs.soton.ac.uk
Subject: [EP-tech] Problem with searching for names starting with Ö


Hi everyone,


We've run into an issue with searching for names containing certain characters and how they are handled by the Tokenizer.pm (https://github.com/eprints/eprints/blob/3.3/perl_lib/EPrints/Index/Tokenizer.pm) module. I notice in the FREETEXT_CHAR_MAPPING that characters are being substituted when indexing takes place or search terms are entered. Many of the substitutions make sense, but some others seem to be done on a phonetic basis? Strangely, this isn't an issue on the simple search form, but if a name is entered in the "Creator" field of the advanced search some strange things can happen.


For example (btw names have been changed!) if an author exists on the system with the surname "Öl", results will not be returned if I search by "Ol" but they will be if I enter "Öl" or, more suprisingly, "Oel" (thanks to the substitution made). 


I understand that in many languages letters such as these are considered to be entirely different characters, but when people search using an English language keyboard they tend to just drop the accents. This has led to a situation where results were not returned in an expected manner. 


Has anyone else encountered this problem? I can change the behaviour by changing the mappings in Tokenizer.pm but that means modifying core code. It also doesn't look to be easily overridable?


Am very interested to hear any thoughts about how to approach this!





Library Systems Developer

University of Kent