EPrints Technical Mailing List Archive

Message: #01491


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

[EP-tech] Re: international character search problem


Hi and thanks for putting up a patch on Git so quickly. I'm sorry to say
that I ran into another problem when I patched our server with the new
perl_lib/EPrints/MetaField/Name.pm. Previously, the regular expression
that splits up initials was located after the test for whether we're doing
a simple search (as opposed to an advanced search) - this is the new
version of the code I'm talking about:

# split up initials
	$v2 =~ s/([\p{Uppercase}])/ $1/g;

	# name searches are case sensitive
	$v2 = "\L$v2";

	if( $search_mode eq "simple" )
	{
		return EPrints::Search::Condition->new(
			$indexmode,
			$dataset,
			$self, 
			$v2 );
	}

Now, if I do a simple search for e.g. "James", the splitting up of
initials above causes a search for " James" to be performed, which doesn't
work so well. I'm not entirely sure what the intention of all of the code
is, so I don't have a fix for this myself yet.

There was another, unrelated, issue I came across while debugging. In the
table eprint__rindex, I noticed that some of the non-ASCII characters in
creators_name are stored correctly - e.g. "zenginoğlu". But then there are
some authors whose names don't come through right. For example, when I
entered a new paper written by "Magó", the creators_name is stored as
"mago" in eprint__rindex.word. Another example I found is "Eötvös", which
is stored as "eoetvoes". I haven't looked into this one in detail myself
yet, so I don't have any pointers as to what the cause may be.

Anyway, the first search issue is more pressing for us, so if anyone on
the list has any ideas for a robust solution that would be great.

Regards
Tommy, Caltech



On 1/17/13 4:38 AM, "Tim Brody" <tdb2@ecs.soton.ac.uk> wrote:

>On Thu, 17 Jan 2013 00:46:37 +0000, Tommy Ingulfsen
><tommy@library.caltech.edu> wrote:
>> I may have found a bug in EPrints 3.3.10. One of the authors in our
>> repository is Anıl Zenginoğlu (if the name doesn't come out right in
>> email, his homepage is  http://www.tapir.caltech.edu/~anil/). Searching
>> for the surname works fine with the simple search, but with the advanced
>> search we don't get any results. I believe the problem is with line 230
>in
>> perl_lib/EPrints/MetaField/Name.pm:
>> 
>> # remove not a-z characters (except ,)
>> $v2 =~ s/[^a-z,]/ /ig;
>> 
>> That code splits up "zenginoğlu" to "zengino lu". A possible solution
>may
>> be
>> 
>> use utf8;
>> …
>> $v2 =~ s/[^\p{L},]/ /ig;
>> …
>> 
>> Maybe someone with a strong encodings-fu can comment?
>
>Hi,
>
>I've written a fix here:
>https://github.com/eprints/eprints/issues/13
>
>-- 
>All the best,
>Tim.
>*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
>*** Archive: http://www.eprints.org/tech.php/
>*** EPrints community wiki: http://wiki.eprints.org/