[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[EP-tech] SQL Problem at EPrints

Hi Agung PW,

I think this may be similar to the issue that Mario reported recently.? 
The database cannot index certain words that are in the indexcodes files 
generated, so that the full text of documents can be indexed.

Before, I proposed two solutions.? Below 1 is a stopgap to fix the issue 
whilst you are on the current version of EPrints but it will mean 
certain words will not be indexed.? 2 is my implemented solution for 
future versions of EPrints that avoids certain words not being indexed:

1. Add the following to your archive's cfg/cfg.d/indexing.pl (if this 
does not exist, copy into place from lib/cfg.d/indexing.pl).

 ??? ??? if( $word =~ m/[^\x20-\xEF]/ )
 ??????? {
 ??????????? $ok=0;
 ??????? }

Add this after the block of code:

 ??? ??? if( $word =~ m/^[A-Z][A-Z0-9]+$/ )
 ??????? {
 ??????????? $ok=1;
 ??????? }

The words that this will stop being indexed are unlikely to be words 
that would be search for, as this code should only affect extended 
characters.? The work I did on Mario's issue found these worlds were 
mostly Latin or Greek characters using a particular font as they were 
part of mathematical equations.? One example is: ???.

2. Look at https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feprints%2Feprints3.4%2Fissues%2F320&data=05%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C686c4fd8655b4bdc287108db3f180c79%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638173144768569517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=9WLw3ge8a7mrz%2F0S0rtxKFmTkJAnnnT1rHbZH9rU3Hc%3D&reserved=0 and merge 
the commit it contains.? This should add mappings for the indexer, so 
these words can now be indexed.? However, for full text indexing, this 
occurs when the indexcodes files is regenerated. epadmin has a command 
to regenerate all these and reindex but that could take a very long time 
with a large repository.? Therefore, I have improved the indexer so that 
the --force flag on "epadmin reindex" will force the indexcodes files to 
be regenerated and make use of this new mappings (see 
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feprints%2Feprints3.4%2Fissues%2F321&data=05%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C686c4fd8655b4bdc287108db3f180c79%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638173144768569517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=mTTnO7S7NZEMf6ykbGwoQVogZ9ZQtdtenzOWvMnpYQc%3D&reserved=0) if you do not want to 
use the new version of epadmin.? Using the "Reindex Item" button in the 
web interface should achieve the same thing.? If you see my earlier 
emails to Mario on the EPrints Tech list, you will see I was a little 
baffled why indexcodes files were only re-generated this was and not 
currently when using epadmin.

Anyway, with either solution, make sure that the indexer is restarted to 
apply the changes made.? (If you intend to use the "Reindex Item" button 
I would also reload the webserver just to be sure). Not restarting will 
not affect you initial use of "epadmin reindex" for specific eprints you 
want to test/fix but will prevent the changes being applied for future 
indexing tasks carried out by the indexer.


David Newman

On 17/04/2023 12:51 am, Agung Prasetyo W. via Eprints-tech wrote:
> *CAUTION:* This e-mail originated outside the University of Southampton.
> Hi,
> When I running command : epadmin reindex *repository_id* *dataset_id* 
> [*eprint_id*]
> I got an error like this :
> Indexed item: eprint/7039
> *DBD::mysql::st execute failed: Incorrect string value: 
> '\xF0\x9D\x91\x9F13' for column 'word' at row 1 at 
> /usr/share/eprints3/bin/../perl_lib/EPrints/Database.pm line 1287.*
> Indexed item: eprint/7040
> Indexed item: eprint/7041
> *DBD::mysql::st execute failed: Incorrect string value: 
> '\xF0\x9D\x91\xA6\xF0\x9D...' for column 'word' at row 1 at 
> /usr/share/eprints3/bin/../perl_lib/EPrints/Database.pm line 1287.*
> Indexed item: eprint/7042
> Indexed item: eprint/7043
> *DBD::mysql::st execute failed: Incorrect string value: 
> '\xF0\x9D\x90\xBF\xF0\x9D...' for column 'word' at row 1 at 
> /usr/share/eprints3/bin/../perl_lib/EPrints/Database.pm line 1287.*
> Indexed item: eprint/7044
> *DBD::mysql::st execute failed: Incorrect string value: 
> '\xF0\x9D\x91\xA1\xF0\x9D...' for column 'word' at row 1 at 
> /usr/share/eprints3/bin/../perl_lib/EPrints/Database.pm line 1287.*
> Indexed item: eprint/7045
> Indexed item: eprint/7046
> *DBD::mysql::st execute failed: Incorrect string value: 
> '\xF0\x9D\x91\x9D\xF0\x9D...' for column 'word' at row 1 at 
> /usr/share/eprints3/bin/../perl_lib/EPrints/Database.pm line 1287*.
> Indexed item: eprint/7047
> Is there any solution for this problem?
> Thank you.
> Regards,
> Agung PW
> *** Options:http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
> *** Archive:https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&data=05%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C686c4fd8655b4bdc287108db3f180c79%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638173144768569517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=vUy0YH0BKSKdXAk14VmKXbpvkRomDMe2RaHo8qDrd1E%3D&reserved=0
> *** EPrints community wiki:https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&data=05%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C686c4fd8655b4bdc287108db3f180c79%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638173144768569517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=a8T4IAjmmpzuYdv4h4cFyF6CZklEG9i1ookVuN30bz8%3D&reserved=0
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20230417/89a72a72/attachment-0001.html