Tech List

[index] [options] [help]
See the Mailing Lists Page for how to subscribe and unsubscribe.

eprints_tech messages

Please note: this page shows emails that have been sent to the eprints_tech mailing list. Some of these may be spam emails we have failed to filter.

[EP-tech] either me or Eprints is missing on utf8 - bug/feature request

From: "Roman Chyla" <roman.chyla AT gmail.com>
Date: Sun, 11 May 2008 00:41:59 +0200


Threading:      • This Message
             Re: [EP-tech] either me or Eprints is missing on utf8 - bug/feature request from tdb01r AT ecs.soton.ac.uk
             Re: [EP-tech] either me or Eprints is missing on utf8 - bug/feature request from roman.chyla AT gmail.com
             Re: [EP-tech] either me or Eprints is missing on utf8 - bug/feature request from roman.chyla AT gmail.com

*** 
http://www.eprints.org/tech.php/id/%3Cea0115e90805101541v542d2c1o3ad4f731703bfba9%40mail.gmail.com%3E
*** EPrints community wiki - http://wiki.eprints.org/

Hello,
Excuse my premature senility, some things (serious ones) are not clear
to me. I have successfully converted my database to utf8, fighting
with several issues and found (possibly) a bug

Firstly, one cannot have default collation set for the database like this:
Alter database eprints3 character set utf8 collate utf8_czech_ci;

because this will happen
DBD::mysql::st execute failed: Illegal mix of collations
(utf8_general_ci,IMPLICIT) and (utf8_czech_ci,IMPLICIT) for operation
'=' at /opt/eprints3/perl_lib/EPrints/Database.pm line 2363.
SQL ERROR (execute): SELECT M.subjectid, M.pos, M.ancestors, C.pos
FROM cache5960 AS C, subject_ancestors AS M WHERE M.subjectid =
C.subjectid AND C.pos>0 ORDER BY C.pos
SQL ERROR (execute): Illegal mix of collations
(utf8_general_ci,IMPLICIT) and (utf8_czech_ci,IMPLICIT) for operation
'='
DBD::mysql::st fetchrow_array failed: fetch() without execute() at
/opt/eprints3/perl_lib/EPrints/Database.pm line 2073.


I have also converted all my data to utf8 (and I am sure they are
correct in the database).
But Eprints will start to complain that there is a wrong encoding

I can fix it
$self->do('SET NAMES utf8');

in the Database.pm, when instances is created. And everything is fine.

But this should not be necessary (?) Am I missing something? Or are
all the archives of EPrints storing utf8 as latin1 internally in the
databases? (and as somebody reported, proper sorting does not work).
Shall I install new version of EPrints? Please give me some reasonable
answers, it can't be EPrints, it must be me...

Thanks,


roman


here is the convert how-to, I will eventually put it in the wiki (it
depends on your answers)


#dump schema of the database
mysqldump --no-data --set-charset -u root -p<password> <db_name> 
> schema.sql

#dump the data, it will be actually utf8 encoded, don't be fooled be
the charset latin1 bit
mysqldump --no-create-info --skip-set-charset -u root -p<yourpassword>
--default-character-set=latin1 <db_name> > data.sql

#open the schema.sql in an editor and:
- replace all occurences of CHARSET=latin1 for CHARSET=utf8
- also change the dafault NULL charset for columns (see
http://bugs.mysql.com/bug.php?id=23073)
-- search for "varchar(255)" and replace "with varchar(255) 
CHARACTER SET utf8 "

#set the utf encoding for the data

in linux you can do: echo 'SET NAMES utf8;' | cat - data.sql > datautf.sql

#now load the edited db schema  (this will recreate the database, AND
DESTROY ALL THE DATA!!! - make sure you have them in datautf.sql)
mysql <db_name> -u root -p < schema.sql

#load the data
mysql <db_name> -u root -p < datautf.sql

----
now you are done - if you want to set the default encoding for the
database, but thats useful only for newly created tables (and might be
better to set charset globally, for the whole server) you can issue
alter database <db_name> character set utf8 collate;


Re: [EP-tech] either me or Eprints is missing on utf8 - bug/feature request

From: Tim Brody <tdb01r AT ecs.soton.ac.uk>
Date: Mon, 12 May 2008 10:26:21 +0100


Threading: [EP-tech] either me or Eprints is missing on utf8 - bug/feature request from roman.chyla AT gmail.com
      • This Message
             Re: [EP-tech] either me or Eprints is missing on utf8 - bug/feature request from tdb01r AT ecs.soton.ac.uk

*** http://www.eprints.org/tech.php/id/%3C48280D3D.5060900%40ecs.soton.ac.uk%3E
*** EPrints community wiki - http://wiki.eprints.org/

EPrints doesn't expect the database to be in Unicode (or any other 
encoding).

The theory is that if you want a sorting other than in English you will 
write a custom method for your language and use it in the 
"make_value_orderkey" property on the fields that aren't in English.

This property is briefly documented at:
http://wiki.eprints.org/w/Metadata

All the best,
Tim.

Roman Chyla wrote:
> *** 
http://www.eprints.org/tech.php/id/%3Cea0115e90805101541v542d2c1o3ad4f731703bfba9%40mail.gmail.com%3E
> *** EPrints community wiki - http://wiki.eprints.org/
>
> Hello,
> Excuse my premature senility, some things (serious ones) are not clear
> to me. I have successfully converted my database to utf8, fighting
> with several issues and found (possibly) a bug
>
> Firstly, one cannot have default collation set for the database like this:
> Alter database eprints3 character set utf8 collate utf8_czech_ci;
>
> because this will happen
> DBD::mysql::st execute failed: Illegal mix of collations
> (utf8_general_ci,IMPLICIT) and (utf8_czech_ci,IMPLICIT) for operation
> '=' at /opt/eprints3/perl_lib/EPrints/Database.pm line 2363.
> SQL ERROR (execute): SELECT M.subjectid, M.pos, M.ancestors, C.pos
> FROM cache5960 AS C, subject_ancestors AS M WHERE M.subjectid =
> C.subjectid AND C.pos>0 ORDER BY C.pos
> SQL ERROR (execute): Illegal mix of collations
> (utf8_general_ci,IMPLICIT) and (utf8_czech_ci,IMPLICIT) for operation
> '='
> DBD::mysql::st fetchrow_array failed: fetch() without execute() at
> /opt/eprints3/perl_lib/EPrints/Database.pm line 2073.
>
>
> I have also converted all my data to utf8 (and I am sure they are
> correct in the database).
> But Eprints will start to complain that there is a wrong encoding
>
> I can fix it
> $self->do('SET NAMES utf8');
>
> in the Database.pm, when instances is created. And everything is fine.
>
> But this should not be necessary (?) Am I missing something? Or are
> all the archives of EPrints storing utf8 as latin1 internally in the
> databases? (and as somebody reported, proper sorting does not work).
> Shall I install new version of EPrints? Please give me some reasonable
> answers, it can't be EPrints, it must be me...
>
> Thanks,
>
>
> roman
>
>
> here is the convert how-to, I will eventually put it in the wiki (it
> depends on your answers)
>
>
> #dump schema of the database
> mysqldump --no-data --set-charset -u root -p<password> 
<db_name> > schema.sql
>
> #dump the data, it will be actually utf8 encoded, don't be fooled be
> the charset latin1 bit
> mysqldump --no-create-info --skip-set-charset -u root 
-p<yourpassword>
> --default-character-set=latin1 <db_name> > data.sql
>
> #open the schema.sql in an editor and:
> - replace all occurences of CHARSET=latin1 for CHARSET=utf8
> - also change the dafault NULL charset for columns (see
> http://bugs.mysql.com/bug.php?id=23073)
> -- search for "varchar(255)" and replace "with varchar(255) 
CHARACTER SET utf8 "
>
> #set the utf encoding for the data
>
> in linux you can do: echo 'SET NAMES utf8;' | cat - data.sql > 
datautf.sql
>
> #now load the edited db schema  (this will recreate the database, AND
> DESTROY ALL THE DATA!!! - make sure you have them in datautf.sql)
> mysql <db_name> -u root -p < schema.sql
>
> #load the data
> mysql <db_name> -u root -p < datautf.sql
>
> ----
> now you are done - if you want to set the default encoding for the
> database, but thats useful only for newly created tables (and might be
> better to set charset globally, for the whole server) you can issue
> alter database <db_name> character set utf8 collate;
>
>   


Re: [EP-tech] either me or Eprints is missing on utf8 - bug/feature request

From: "Roman Chyla" <roman.chyla AT gmail.com>
Date: Mon, 12 May 2008 12:06:20 +0200


Threading: [EP-tech] either me or Eprints is missing on utf8 - bug/feature request from roman.chyla AT gmail.com
      • This Message

*** 
http://www.eprints.org/tech.php/id/%3Cea0115e90805120306kd7ea332ic8531c2d110bf695%40mail.gmail.com%3E
*** EPrints community wiki - http://wiki.eprints.org/

Thank you Tim,

but how can the community live without Unicode? How can they search
for unicode strings? It is very expensive to use own sorting routines
when the database can do it faster and better. I cannot do without
unicode and I suppose hundreds of thousands sites out there neither.
If we can provide mappings for metadata fields, we cannot deal with
all that possible variantions coming from the fulltext - that is a
lost fight.

My EPrints installation is going fine with unicode, but indexing is
stripping off unicode strings (searching works well). I guess I am on
my own here to fix it...

Please, register this as a serious feature request - storing unicode
strings as latin1 is not the same as having full unicode support. And
it is so easy to switch to unicode, actually, it will not cost
anything compared to benefits.

Best,

  roman

On Mon, May 12, 2008 at 11:26 AM, Tim Brody <tdb01r AT ecs.soton.ac.uk> 
wrote:
> ***
> 
http://www.eprints.org/tech.php/id/%3C48280D3D.5060900%40ecs.soton.ac.uk%3E
>  *** EPrints community wiki - http://wiki.eprints.org/
>
>  EPrints doesn't expect the database to be in Unicode (or any other
> encoding).
>
>  The theory is that if you want a sorting other than in English you will
> write a custom method for your language and use it in the
> "make_value_orderkey" property on the fields that aren't in 
English.
>
>  This property is briefly documented at:
>  http://wiki.eprints.org/w/Metadata
>
>  All the best,
>  Tim.
>
>  Roman Chyla wrote:
>
> > ***
> 
http://www.eprints.org/tech.php/id/%3Cea0115e90805101541v542d2c1o3ad4f731703bfba9%40mail.gmail.com%3E
> > *** EPrints community wiki - http://wiki.eprints.org/
> >
> >
> >
> >
> > Hello,
> > Excuse my premature senility, some things (serious ones) are not 
clear
> > to me. I have successfully converted my database to utf8, fighting
> > with several issues and found (possibly) a bug
> >
> > Firstly, one cannot have default collation set for the database like 
this:
> > Alter database eprints3 character set utf8 collate utf8_czech_ci;
> >
> > because this will happen
> > DBD::mysql::st execute failed: Illegal mix of collations
> > (utf8_general_ci,IMPLICIT) and (utf8_czech_ci,IMPLICIT) for operation
> > '=' at /opt/eprints3/perl_lib/EPrints/Database.pm line 2363.
> > SQL ERROR (execute): SELECT M.subjectid, M.pos, M.ancestors, C.pos
> > FROM cache5960 AS C, subject_ancestors AS M WHERE M.subjectid =
> > C.subjectid AND C.pos>0 ORDER BY C.pos
> > SQL ERROR (execute): Illegal mix of collations
> > (utf8_general_ci,IMPLICIT) and (utf8_czech_ci,IMPLICIT) for operation
> > '='
> > DBD::mysql::st fetchrow_array failed: fetch() without execute() at
> > /opt/eprints3/perl_lib/EPrints/Database.pm line 2073.
> >
> >
> > I have also converted all my data to utf8 (and I am sure they are
> > correct in the database).
> > But Eprints will start to complain that there is a wrong encoding
> >
> > I can fix it
> > $self->do('SET NAMES utf8');
> >
> > in the Database.pm, when instances is created. And everything is 
fine.
> >
> > But this should not be necessary (?) Am I missing something? Or are
> > all the archives of EPrints storing utf8 as latin1 internally in the
> > databases? (and as somebody reported, proper sorting does not work).
> > Shall I install new version of EPrints? Please give me some 
reasonable
> > answers, it can't be EPrints, it must be me...
> >
> > Thanks,
> >
> >
> > roman
> >
> >
> > here is the convert how-to, I will eventually put it in the wiki (it
> > depends on your answers)
> >
> >
> > #dump schema of the database
> > mysqldump --no-data --set-charset -u root -p<password> 
<db_name> >
> schema.sql
> >
> > #dump the data, it will be actually utf8 encoded, don't be fooled be
> > the charset latin1 bit
> > mysqldump --no-create-info --skip-set-charset -u root 
-p<yourpassword>
> > --default-character-set=latin1 <db_name> > data.sql
> >
> > #open the schema.sql in an editor and:
> > - replace all occurences of CHARSET=latin1 for CHARSET=utf8
> > - also change the dafault NULL charset for columns (see
> > http://bugs.mysql.com/bug.php?id=23073)
> > -- search for "varchar(255)" and replace "with 
varchar(255) CHARACTER SET
> utf8 "
> >
> > #set the utf encoding for the data
> >
> > in linux you can do: echo 'SET NAMES utf8;' | cat - data.sql > 
datautf.sql
> >
> > #now load the edited db schema  (this will recreate the database, AND
> > DESTROY ALL THE DATA!!! - make sure you have them in datautf.sql)
> > mysql <db_name> -u root -p < schema.sql
> >
> > #load the data
> > mysql <db_name> -u root -p < datautf.sql
> >
> > ----
> > now you are done - if you want to set the default encoding for the
> > database, but thats useful only for newly created tables (and might 
be
> > better to set charset globally, for the whole server) you can issue
> > alter database <db_name> character set utf8 collate;
> >
> >
> >
>
>


Re: [EP-tech] either me or Eprints is missing on utf8 - bug/feature request

From: Tim Brody <tdb01r AT ecs.soton.ac.uk>
Date: Mon, 12 May 2008 11:54:47 +0100


Threading: Re: [EP-tech] either me or Eprints is missing on utf8 - bug/feature request from tdb01r AT ecs.soton.ac.uk
      • This Message

*** http://www.eprints.org/tech.php/id/%3C482821F7.1000700%40ecs.soton.ac.uk%3E
*** EPrints community wiki - http://wiki.eprints.org/

Roman Chyla wrote:
> *** 
http://www.eprints.org/tech.php/id/%3Cea0115e90805120306kd7ea332ic8531c2d110bf695%40mail.gmail.com%3E
> *** EPrints community wiki - http://wiki.eprints.org/
>
> Thank you Tim,
>
> but how can the community live without Unicode? How can they search
> for unicode strings? It is very expensive to use own sorting routines
> when the database can do it faster and better. I cannot do without
> unicode and I suppose hundreds of thousands sites out there neither.
> If we can provide mappings for metadata fields, we cannot deal with
> all that possible variantions coming from the fulltext - that is a
> lost fight.
>
> My EPrints installation is going fine with unicode, but indexing is
> stripping off unicode strings (searching works well). I guess I am on
> my own here to fix it...
>
> Please, register this as a serious feature request - storing unicode
> strings as latin1 is not the same as having full unicode support. And
> it is so easy to switch to unicode, actually, it will not cost
> anything compared to benefits.
>   
What are you trying to do that EPrints doesn't do?

Internationalisation and localisation are handled internally by EPrints. 
Strictly the database is being asked to store data as binary, rather 
than "latin-1".

I suspect indexing is always going to be EPrints-specific, because you 
will want to expand something like:
Völker to {Völker,Volker,Voelker}

At the moment the ordervalues_* tables are used by searches. You could 
change their character set to utf-8 and the collation to the appropriate 
language-specific collation. But the ordering on views is handled 
internally by EPrints.

Doing something a bit smarter using the database collations may be 
possible with 3.1.

Cheers,
Tim.


Re: [EP-tech] either me or Eprints is missing on utf8 - bug/feature request

From: "Roman Chyla" <roman.chyla AT gmail.com>
Date: Mon, 12 May 2008 13:52:53 +0200


Threading: [EP-tech] either me or Eprints is missing on utf8 - bug/feature request from roman.chyla AT gmail.com
      • This Message

*** 
http://www.eprints.org/tech.php/id/%3Cea0115e90805120452t1b21fd5ct1bd9b6ee6d022499%40mail.gmail.com%3E
*** EPrints community wiki - http://wiki.eprints.org/

On Mon, May 12, 2008 at 12:54 PM, Tim Brody <tdb01r AT ecs.soton.ac.uk> 
wrote:
> ***
> 
http://www.eprints.org/tech.php/id/%3C482821F7.1000700%40ecs.soton.ac.uk%3E
>
>  *** EPrints community wiki - http://wiki.eprints.org/
>
>  Roman Chyla wrote:
>
> > ***
> 
http://www.eprints.org/tech.php/id/%3Cea0115e90805120306kd7ea332ic8531c2d110bf695%40mail.gmail.com%3E
> >
> > *** EPrints community wiki - http://wiki.eprints.org/
> >
> >
> > Thank you Tim,
> >
> > but how can the community live without Unicode? How can they search
> > for unicode strings? It is very expensive to use own sorting routines
> > when the database can do it faster and better. I cannot do without
> > unicode and I suppose hundreds of thousands sites out there neither.
> > If we can provide mappings for metadata fields, we cannot deal with
> > all that possible variantions coming from the fulltext - that is a
> > lost fight.
> >
> > My EPrints installation is going fine with unicode, but indexing is
> > stripping off unicode strings (searching works well). I guess I am on
> > my own here to fix it...
> >
> > Please, register this as a serious feature request - storing unicode
> > strings as latin1 is not the same as having full unicode support. And
> > it is so easy to switch to unicode, actually, it will not cost
> > anything compared to benefits.
> >
> >
>  What are you trying to do that EPrints doesn't do?

I have discovered this because searching does not work for accented
characters. Or more precisely, it works sometimes - eg. Čefelín bud
fails for "čefelín" - that is a name of the author and his name was
encoded (binary) but obviously not converted to lower case.

Then, issue 2) - some accented characters are stripped off completely
before they make their way into the index. I suspect it is an issue
with collation and regexes in perl and my environment, but I didn't
figure this out yet.

That was how I found out that EPrints is internally working with
unicode, but its database is not working with unicode at all.

>
>  Internationalisation and localisation are handled internally by EPrints.
> Strictly the database is being asked to store data as binary, rather than
> "latin-1".

right, sorry - but then no collation is working and you loose other
things like functions, not that upper(name) is that important, but
collation stuff is

>
>  I suspect indexing is always going to be EPrints-specific, because you 
will
> want to expand something like:
>  Völker to {Völker,Volker,Voelker}

I agree, it is great to have it. But better would be to have it
together with unicode. I can think of many languages and there is no
way for a mere mortal to prepare mappings for all of them. Maybe I got
some configuration wrong, it is a new server here, but still that
should not be an issue if the system supports utf. Are there any
installations of EPrints in russian, arabic, chinese? Does the
searching work for them? I doubt this.

>
>  At the moment the ordervalues_* tables are used by searches. You could
> change their character set to utf-8 and the collation to the appropriate
> language-specific collation. But the ordering on views is handled 
internally
> by EPrints.

I have changed the whole database and converted tables, I am setting
the "set names utf8" when connection begins  and I can query the
database using standard tools, I can write to it too from external
applications. I can't think of any reason why it should not work like
this. I only need to fix the indexer now, it is forgetting accented
characters ( but that is an issue with perl script). I have to find
it.

>
>  Doing something a bit smarter using the database collations may be 
possible
> with 3.1.

Is it there already? I'd better upgrade

Cheers,

  roman

>
>  Cheers,
>  Tim.
>
>


[index] [options] [help]