Tech List

[index] [prev] [next] [options] [help]
See the Mailing Lists Page for how to subscribe and unsubscribe.

eprints_tech messages

Please note: this page shows emails that have been sent to the eprints_tech mailing list. Some of these may be spam emails we have failed to filter.

Re: [EP-tech] either me or Eprints is missing on utf8 - bug/feature request

From: "Roman Chyla" <roman.chyla AT gmail.com>
Date: Mon, 12 May 2008 13:52:53 +0200


Threading: [EP-tech] either me or Eprints is missing on utf8 - bug/feature request from roman.chyla AT gmail.com
      • This Message

*** 
http://www.eprints.org/tech.php/id/%3Cea0115e90805120452t1b21fd5ct1bd9b6ee6d022499%40mail.gmail.com%3E
*** EPrints community wiki - http://wiki.eprints.org/

On Mon, May 12, 2008 at 12:54 PM, Tim Brody <tdb01r AT ecs.soton.ac.uk> 
wrote:
> ***
> 
http://www.eprints.org/tech.php/id/%3C482821F7.1000700%40ecs.soton.ac.uk%3E
>
>  *** EPrints community wiki - http://wiki.eprints.org/
>
>  Roman Chyla wrote:
>
> > ***
> 
http://www.eprints.org/tech.php/id/%3Cea0115e90805120306kd7ea332ic8531c2d110bf695%40mail.gmail.com%3E
> >
> > *** EPrints community wiki - http://wiki.eprints.org/
> >
> >
> > Thank you Tim,
> >
> > but how can the community live without Unicode? How can they search
> > for unicode strings? It is very expensive to use own sorting routines
> > when the database can do it faster and better. I cannot do without
> > unicode and I suppose hundreds of thousands sites out there neither.
> > If we can provide mappings for metadata fields, we cannot deal with
> > all that possible variantions coming from the fulltext - that is a
> > lost fight.
> >
> > My EPrints installation is going fine with unicode, but indexing is
> > stripping off unicode strings (searching works well). I guess I am on
> > my own here to fix it...
> >
> > Please, register this as a serious feature request - storing unicode
> > strings as latin1 is not the same as having full unicode support. And
> > it is so easy to switch to unicode, actually, it will not cost
> > anything compared to benefits.
> >
> >
>  What are you trying to do that EPrints doesn't do?

I have discovered this because searching does not work for accented
characters. Or more precisely, it works sometimes - eg. Čefelín bud
fails for "čefelín" - that is a name of the author and his name was
encoded (binary) but obviously not converted to lower case.

Then, issue 2) - some accented characters are stripped off completely
before they make their way into the index. I suspect it is an issue
with collation and regexes in perl and my environment, but I didn't
figure this out yet.

That was how I found out that EPrints is internally working with
unicode, but its database is not working with unicode at all.

>
>  Internationalisation and localisation are handled internally by EPrints.
> Strictly the database is being asked to store data as binary, rather than
> "latin-1".

right, sorry - but then no collation is working and you loose other
things like functions, not that upper(name) is that important, but
collation stuff is

>
>  I suspect indexing is always going to be EPrints-specific, because you 
will
> want to expand something like:
>  Völker to {Völker,Volker,Voelker}

I agree, it is great to have it. But better would be to have it
together with unicode. I can think of many languages and there is no
way for a mere mortal to prepare mappings for all of them. Maybe I got
some configuration wrong, it is a new server here, but still that
should not be an issue if the system supports utf. Are there any
installations of EPrints in russian, arabic, chinese? Does the
searching work for them? I doubt this.

>
>  At the moment the ordervalues_* tables are used by searches. You could
> change their character set to utf-8 and the collation to the appropriate
> language-specific collation. But the ordering on views is handled 
internally
> by EPrints.

I have changed the whole database and converted tables, I am setting
the "set names utf8" when connection begins  and I can query the
database using standard tools, I can write to it too from external
applications. I can't think of any reason why it should not work like
this. I only need to fix the indexer now, it is forgetting accented
characters ( but that is an issue with perl script). I have to find
it.

>
>  Doing something a bit smarter using the database collations may be 
possible
> with 3.1.

Is it there already? I'd better upgrade

Cheers,

  roman

>
>  Cheers,
>  Tim.
>
>


[index] [prev] [next] [options] [help]