See the Mailing Lists Page for how to subscribe and unsubscribe.
eprints_tech messages
Please note: this page shows emails that have been sent to the eprints_tech mailing list. Some of these may be spam emails we have failed to filter.
[EP-tech] either me or Eprints is missing on utf8 - bug/feature request
From: "Roman Chyla" <roman.chyla AT gmail.com>
Date: Sun, 11 May 2008 00:41:59 +0200
| Threading: | • This Message → Re: [EP-tech] either me or Eprints is missing on utf8 - bug/feature request from tdb01r AT ecs.soton.ac.uk → Re: [EP-tech] either me or Eprints is missing on utf8 - bug/feature request from roman.chyla AT gmail.com → Re: [EP-tech] either me or Eprints is missing on utf8 - bug/feature request from roman.chyla AT gmail.com |
*** ↵ http://www.eprints.org/tech.php/id/%3Cea0115e90805101541v542d2c1o3ad4f731703bfba9%40mail.gmail.com%3E *** EPrints community wiki - http://wiki.eprints.org/ Hello, Excuse my premature senility, some things (serious ones) are not clear to me. I have successfully converted my database to utf8, fighting with several issues and found (possibly) a bug Firstly, one cannot have default collation set for the database like this: Alter database eprints3 character set utf8 collate utf8_czech_ci; because this will happen DBD::mysql::st execute failed: Illegal mix of collations (utf8_general_ci,IMPLICIT) and (utf8_czech_ci,IMPLICIT) for operation '=' at /opt/eprints3/perl_lib/EPrints/Database.pm line 2363. SQL ERROR (execute): SELECT M.subjectid, M.pos, M.ancestors, C.pos FROM cache5960 AS C, subject_ancestors AS M WHERE M.subjectid = C.subjectid AND C.pos>0 ORDER BY C.pos SQL ERROR (execute): Illegal mix of collations (utf8_general_ci,IMPLICIT) and (utf8_czech_ci,IMPLICIT) for operation '=' DBD::mysql::st fetchrow_array failed: fetch() without execute() at /opt/eprints3/perl_lib/EPrints/Database.pm line 2073. I have also converted all my data to utf8 (and I am sure they are correct in the database). But Eprints will start to complain that there is a wrong encoding I can fix it $self->do('SET NAMES utf8'); in the Database.pm, when instances is created. And everything is fine. But this should not be necessary (?) Am I missing something? Or are all the archives of EPrints storing utf8 as latin1 internally in the databases? (and as somebody reported, proper sorting does not work). Shall I install new version of EPrints? Please give me some reasonable answers, it can't be EPrints, it must be me... Thanks, roman here is the convert how-to, I will eventually put it in the wiki (it depends on your answers) #dump schema of the database mysqldump --no-data --set-charset -u root -p<password> <db_name> ↵ > schema.sql #dump the data, it will be actually utf8 encoded, don't be fooled be the charset latin1 bit mysqldump --no-create-info --skip-set-charset -u root -p<yourpassword> --default-character-set=latin1 <db_name> > data.sql #open the schema.sql in an editor and: - replace all occurences of CHARSET=latin1 for CHARSET=utf8 - also change the dafault NULL charset for columns (see http://bugs.mysql.com/bug.php?id=23073) -- search for "varchar(255)" and replace "with varchar(255) ↵ CHARACTER SET utf8 " #set the utf encoding for the data in linux you can do: echo 'SET NAMES utf8;' | cat - data.sql > datautf.sql #now load the edited db schema (this will recreate the database, AND DESTROY ALL THE DATA!!! - make sure you have them in datautf.sql) mysql <db_name> -u root -p < schema.sql #load the data mysql <db_name> -u root -p < datautf.sql ---- now you are done - if you want to set the default encoding for the database, but thats useful only for newly created tables (and might be better to set charset globally, for the whole server) you can issue alter database <db_name> character set utf8 collate;
Re: [EP-tech] either me or Eprints is missing on utf8 - bug/feature request
From: Tim Brody <tdb01r AT ecs.soton.ac.uk>
Date: Mon, 12 May 2008 10:26:21 +0100
| Threading: | ↑ [EP-tech] either me or Eprints is missing on utf8 - bug/feature request from roman.chyla AT gmail.com • This Message → Re: [EP-tech] either me or Eprints is missing on utf8 - bug/feature request from tdb01r AT ecs.soton.ac.uk |
*** http://www.eprints.org/tech.php/id/%3C48280D3D.5060900%40ecs.soton.ac.uk%3E *** EPrints community wiki - http://wiki.eprints.org/ EPrints doesn't expect the database to be in Unicode (or any other encoding). The theory is that if you want a sorting other than in English you will write a custom method for your language and use it in the "make_value_orderkey" property on the fields that aren't in English. This property is briefly documented at: http://wiki.eprints.org/w/Metadata All the best, Tim. Roman Chyla wrote: > *** ↵ http://www.eprints.org/tech.php/id/%3Cea0115e90805101541v542d2c1o3ad4f731703bfba9%40mail.gmail.com%3E > *** EPrints community wiki - http://wiki.eprints.org/ > > Hello, > Excuse my premature senility, some things (serious ones) are not clear > to me. I have successfully converted my database to utf8, fighting > with several issues and found (possibly) a bug > > Firstly, one cannot have default collation set for the database like this: > Alter database eprints3 character set utf8 collate utf8_czech_ci; > > because this will happen > DBD::mysql::st execute failed: Illegal mix of collations > (utf8_general_ci,IMPLICIT) and (utf8_czech_ci,IMPLICIT) for operation > '=' at /opt/eprints3/perl_lib/EPrints/Database.pm line 2363. > SQL ERROR (execute): SELECT M.subjectid, M.pos, M.ancestors, C.pos > FROM cache5960 AS C, subject_ancestors AS M WHERE M.subjectid = > C.subjectid AND C.pos>0 ORDER BY C.pos > SQL ERROR (execute): Illegal mix of collations > (utf8_general_ci,IMPLICIT) and (utf8_czech_ci,IMPLICIT) for operation > '=' > DBD::mysql::st fetchrow_array failed: fetch() without execute() at > /opt/eprints3/perl_lib/EPrints/Database.pm line 2073. > > > I have also converted all my data to utf8 (and I am sure they are > correct in the database). > But Eprints will start to complain that there is a wrong encoding > > I can fix it > $self->do('SET NAMES utf8'); > > in the Database.pm, when instances is created. And everything is fine. > > But this should not be necessary (?) Am I missing something? Or are > all the archives of EPrints storing utf8 as latin1 internally in the > databases? (and as somebody reported, proper sorting does not work). > Shall I install new version of EPrints? Please give me some reasonable > answers, it can't be EPrints, it must be me... > > Thanks, > > > roman > > > here is the convert how-to, I will eventually put it in the wiki (it > depends on your answers) > > > #dump schema of the database > mysqldump --no-data --set-charset -u root -p<password> ↵ <db_name> > schema.sql > > #dump the data, it will be actually utf8 encoded, don't be fooled be > the charset latin1 bit > mysqldump --no-create-info --skip-set-charset -u root ↵ -p<yourpassword> > --default-character-set=latin1 <db_name> > data.sql > > #open the schema.sql in an editor and: > - replace all occurences of CHARSET=latin1 for CHARSET=utf8 > - also change the dafault NULL charset for columns (see > http://bugs.mysql.com/bug.php?id=23073) > -- search for "varchar(255)" and replace "with varchar(255) ↵ CHARACTER SET utf8 " > > #set the utf encoding for the data > > in linux you can do: echo 'SET NAMES utf8;' | cat - data.sql > ↵ datautf.sql > > #now load the edited db schema (this will recreate the database, AND > DESTROY ALL THE DATA!!! - make sure you have them in datautf.sql) > mysql <db_name> -u root -p < schema.sql > > #load the data > mysql <db_name> -u root -p < datautf.sql > > ---- > now you are done - if you want to set the default encoding for the > database, but thats useful only for newly created tables (and might be > better to set charset globally, for the whole server) you can issue > alter database <db_name> character set utf8 collate; > >
Re: [EP-tech] either me or Eprints is missing on utf8 - bug/feature request
From: "Roman Chyla" <roman.chyla AT gmail.com>
Date: Mon, 12 May 2008 12:06:20 +0200
| Threading: | ↑ [EP-tech] either me or Eprints is missing on utf8 - bug/feature request from roman.chyla AT gmail.com • This Message |
*** ↵ http://www.eprints.org/tech.php/id/%3Cea0115e90805120306kd7ea332ic8531c2d110bf695%40mail.gmail.com%3E *** EPrints community wiki - http://wiki.eprints.org/ Thank you Tim, but how can the community live without Unicode? How can they search for unicode strings? It is very expensive to use own sorting routines when the database can do it faster and better. I cannot do without unicode and I suppose hundreds of thousands sites out there neither. If we can provide mappings for metadata fields, we cannot deal with all that possible variantions coming from the fulltext - that is a lost fight. My EPrints installation is going fine with unicode, but indexing is stripping off unicode strings (searching works well). I guess I am on my own here to fix it... Please, register this as a serious feature request - storing unicode strings as latin1 is not the same as having full unicode support. And it is so easy to switch to unicode, actually, it will not cost anything compared to benefits. Best, roman On Mon, May 12, 2008 at 11:26 AM, Tim Brody <tdb01r AT ecs.soton.ac.uk> ↵ wrote: > *** > ↵ http://www.eprints.org/tech.php/id/%3C48280D3D.5060900%40ecs.soton.ac.uk%3E > *** EPrints community wiki - http://wiki.eprints.org/ > > EPrints doesn't expect the database to be in Unicode (or any other > encoding). > > The theory is that if you want a sorting other than in English you will > write a custom method for your language and use it in the > "make_value_orderkey" property on the fields that aren't in ↵ English. > > This property is briefly documented at: > http://wiki.eprints.org/w/Metadata > > All the best, > Tim. > > Roman Chyla wrote: > > > *** > ↵ http://www.eprints.org/tech.php/id/%3Cea0115e90805101541v542d2c1o3ad4f731703bfba9%40mail.gmail.com%3E > > *** EPrints community wiki - http://wiki.eprints.org/ > > > > > > > > > > Hello, > > Excuse my premature senility, some things (serious ones) are not ↵ clear > > to me. I have successfully converted my database to utf8, fighting > > with several issues and found (possibly) a bug > > > > Firstly, one cannot have default collation set for the database like ↵ this: > > Alter database eprints3 character set utf8 collate utf8_czech_ci; > > > > because this will happen > > DBD::mysql::st execute failed: Illegal mix of collations > > (utf8_general_ci,IMPLICIT) and (utf8_czech_ci,IMPLICIT) for operation > > '=' at /opt/eprints3/perl_lib/EPrints/Database.pm line 2363. > > SQL ERROR (execute): SELECT M.subjectid, M.pos, M.ancestors, C.pos > > FROM cache5960 AS C, subject_ancestors AS M WHERE M.subjectid = > > C.subjectid AND C.pos>0 ORDER BY C.pos > > SQL ERROR (execute): Illegal mix of collations > > (utf8_general_ci,IMPLICIT) and (utf8_czech_ci,IMPLICIT) for operation > > '=' > > DBD::mysql::st fetchrow_array failed: fetch() without execute() at > > /opt/eprints3/perl_lib/EPrints/Database.pm line 2073. > > > > > > I have also converted all my data to utf8 (and I am sure they are > > correct in the database). > > But Eprints will start to complain that there is a wrong encoding > > > > I can fix it > > $self->do('SET NAMES utf8'); > > > > in the Database.pm, when instances is created. And everything is ↵ fine. > > > > But this should not be necessary (?) Am I missing something? Or are > > all the archives of EPrints storing utf8 as latin1 internally in the > > databases? (and as somebody reported, proper sorting does not work). > > Shall I install new version of EPrints? Please give me some ↵ reasonable > > answers, it can't be EPrints, it must be me... > > > > Thanks, > > > > > > roman > > > > > > here is the convert how-to, I will eventually put it in the wiki (it > > depends on your answers) > > > > > > #dump schema of the database > > mysqldump --no-data --set-charset -u root -p<password> ↵ <db_name> > > schema.sql > > > > #dump the data, it will be actually utf8 encoded, don't be fooled be > > the charset latin1 bit > > mysqldump --no-create-info --skip-set-charset -u root ↵ -p<yourpassword> > > --default-character-set=latin1 <db_name> > data.sql > > > > #open the schema.sql in an editor and: > > - replace all occurences of CHARSET=latin1 for CHARSET=utf8 > > - also change the dafault NULL charset for columns (see > > http://bugs.mysql.com/bug.php?id=23073) > > -- search for "varchar(255)" and replace "with ↵ varchar(255) CHARACTER SET > utf8 " > > > > #set the utf encoding for the data > > > > in linux you can do: echo 'SET NAMES utf8;' | cat - data.sql > ↵ datautf.sql > > > > #now load the edited db schema (this will recreate the database, AND > > DESTROY ALL THE DATA!!! - make sure you have them in datautf.sql) > > mysql <db_name> -u root -p < schema.sql > > > > #load the data > > mysql <db_name> -u root -p < datautf.sql > > > > ---- > > now you are done - if you want to set the default encoding for the > > database, but thats useful only for newly created tables (and might ↵ be > > better to set charset globally, for the whole server) you can issue > > alter database <db_name> character set utf8 collate; > > > > > > > >
Re: [EP-tech] either me or Eprints is missing on utf8 - bug/feature request
From: Tim Brody <tdb01r AT ecs.soton.ac.uk>
Date: Mon, 12 May 2008 11:54:47 +0100
| Threading: | ↑ Re: [EP-tech] either me or Eprints is missing on utf8 - bug/feature request from tdb01r AT ecs.soton.ac.uk • This Message |
*** http://www.eprints.org/tech.php/id/%3C482821F7.1000700%40ecs.soton.ac.uk%3E *** EPrints community wiki - http://wiki.eprints.org/ Roman Chyla wrote: > *** ↵ http://www.eprints.org/tech.php/id/%3Cea0115e90805120306kd7ea332ic8531c2d110bf695%40mail.gmail.com%3E > *** EPrints community wiki - http://wiki.eprints.org/ > > Thank you Tim, > > but how can the community live without Unicode? How can they search > for unicode strings? It is very expensive to use own sorting routines > when the database can do it faster and better. I cannot do without > unicode and I suppose hundreds of thousands sites out there neither. > If we can provide mappings for metadata fields, we cannot deal with > all that possible variantions coming from the fulltext - that is a > lost fight. > > My EPrints installation is going fine with unicode, but indexing is > stripping off unicode strings (searching works well). I guess I am on > my own here to fix it... > > Please, register this as a serious feature request - storing unicode > strings as latin1 is not the same as having full unicode support. And > it is so easy to switch to unicode, actually, it will not cost > anything compared to benefits. > What are you trying to do that EPrints doesn't do? Internationalisation and localisation are handled internally by EPrints. Strictly the database is being asked to store data as binary, rather than "latin-1". I suspect indexing is always going to be EPrints-specific, because you will want to expand something like: Völker to {Völker,Volker,Voelker} At the moment the ordervalues_* tables are used by searches. You could change their character set to utf-8 and the collation to the appropriate language-specific collation. But the ordering on views is handled internally by EPrints. Doing something a bit smarter using the database collations may be possible with 3.1. Cheers, Tim.
Re: [EP-tech] either me or Eprints is missing on utf8 - bug/feature request
From: "Roman Chyla" <roman.chyla AT gmail.com>
Date: Mon, 12 May 2008 13:52:53 +0200
| Threading: | ↑ [EP-tech] either me or Eprints is missing on utf8 - bug/feature request from roman.chyla AT gmail.com • This Message |
*** ↵ http://www.eprints.org/tech.php/id/%3Cea0115e90805120452t1b21fd5ct1bd9b6ee6d022499%40mail.gmail.com%3E *** EPrints community wiki - http://wiki.eprints.org/ On Mon, May 12, 2008 at 12:54 PM, Tim Brody <tdb01r AT ecs.soton.ac.uk> ↵ wrote: > *** > ↵ http://www.eprints.org/tech.php/id/%3C482821F7.1000700%40ecs.soton.ac.uk%3E > > *** EPrints community wiki - http://wiki.eprints.org/ > > Roman Chyla wrote: > > > *** > ↵ http://www.eprints.org/tech.php/id/%3Cea0115e90805120306kd7ea332ic8531c2d110bf695%40mail.gmail.com%3E > > > > *** EPrints community wiki - http://wiki.eprints.org/ > > > > > > Thank you Tim, > > > > but how can the community live without Unicode? How can they search > > for unicode strings? It is very expensive to use own sorting routines > > when the database can do it faster and better. I cannot do without > > unicode and I suppose hundreds of thousands sites out there neither. > > If we can provide mappings for metadata fields, we cannot deal with > > all that possible variantions coming from the fulltext - that is a > > lost fight. > > > > My EPrints installation is going fine with unicode, but indexing is > > stripping off unicode strings (searching works well). I guess I am on > > my own here to fix it... > > > > Please, register this as a serious feature request - storing unicode > > strings as latin1 is not the same as having full unicode support. And > > it is so easy to switch to unicode, actually, it will not cost > > anything compared to benefits. > > > > > What are you trying to do that EPrints doesn't do? I have discovered this because searching does not work for accented characters. Or more precisely, it works sometimes - eg. Čefelín bud fails for "čefelín" - that is a name of the author and his name was encoded (binary) but obviously not converted to lower case. Then, issue 2) - some accented characters are stripped off completely before they make their way into the index. I suspect it is an issue with collation and regexes in perl and my environment, but I didn't figure this out yet. That was how I found out that EPrints is internally working with unicode, but its database is not working with unicode at all. > > Internationalisation and localisation are handled internally by EPrints. > Strictly the database is being asked to store data as binary, rather than > "latin-1". right, sorry - but then no collation is working and you loose other things like functions, not that upper(name) is that important, but collation stuff is > > I suspect indexing is always going to be EPrints-specific, because you ↵ will > want to expand something like: > Völker to {Völker,Volker,Voelker} I agree, it is great to have it. But better would be to have it together with unicode. I can think of many languages and there is no way for a mere mortal to prepare mappings for all of them. Maybe I got some configuration wrong, it is a new server here, but still that should not be an issue if the system supports utf. Are there any installations of EPrints in russian, arabic, chinese? Does the searching work for them? I doubt this. > > At the moment the ordervalues_* tables are used by searches. You could > change their character set to utf-8 and the collation to the appropriate > language-specific collation. But the ordering on views is handled ↵ internally > by EPrints. I have changed the whole database and converted tables, I am setting the "set names utf8" when connection begins and I can query the database using standard tools, I can write to it too from external applications. I can't think of any reason why it should not work like this. I only need to fix the indexer now, it is forgetting accented characters ( but that is an issue with perl script). I have to find it. > > Doing something a bit smarter using the database collations may be ↵ possible > with 3.1. Is it there already? I'd better upgrade Cheers, roman > > Cheers, > Tim. > >
[index] [options] [help]





