[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[EP-tech] SQL Problem at EPrints

CAUTION: This e-mail originated outside the University of Southampton.
Hi David,

Thank you for your answer. I will try it first and let you know with the result.

Agung Prasetyo W.

On Mon, Apr 17, 2023, 14:47 David R Newman <drn at ecs.soton.ac.uk<mailto:drn at ecs.soton.ac.uk>> wrote:
Hi Agung PW,

I think this may be similar to the issue that Mario reported recently.  The database cannot index certain words that are in the indexcodes files generated, so that the full text of documents can be indexed.

Before, I proposed two solutions.  Below 1 is a stopgap to fix the issue whilst you are on the current version of EPrints but it will mean certain words will not be indexed.  2 is my implemented solution for future versions of EPrints that avoids certain words not being indexed:

1. Add the following to your archive's cfg/cfg.d/indexing.pl<https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Findexing.pl%2F&data=05%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C8bfc21ab6040499e8c3008db3f25d3cf%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638173203919508066%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=6b6ZKytiLXL8Ul2L3QEh5izWsA5h7o8NoLc8AvKKsBg%3D&reserved=0> (if this does not exist, copy into place from lib/cfg.d/indexing.pl<https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Findexing.pl%2F&data=05%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C8bfc21ab6040499e8c3008db3f25d3cf%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638173203919508066%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=6b6ZKytiLXL8Ul2L3QEh5izWsA5h7o8NoLc8AvKKsBg%3D&reserved=0>).

        if( $word =~ m/[^\x20-\xEF]/ )

Add this after the block of code:

        if( $word =~ m/^[A-Z][A-Z0-9]+$/ )

The words that this will stop being indexed are unlikely to be words that would be search for, as this code should only affect extended characters.  The work I did on Mario's issue found these worlds were mostly Latin or Greek characters using a particular font as they were part of mathematical equations.  One example is: ???.

2. Look at  https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feprints%2Feprints3.4%2Fissues%2F320&data=05%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C8bfc21ab6040499e8c3008db3f25d3cf%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638173203919508066%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ADEztL1sFyw5xF8WcoXLrhO2D5CFEqEjvCAUHKN9Lxw%3D&reserved=0 and merge the commit it contains.  This should add mappings for the indexer, so these words can now be indexed.  However, for full text indexing, this occurs when the indexcodes files is regenerated.  epadmin has a command to regenerate all these and reindex but that could take a very long time with a large repository.  Therefore, I have improved the indexer so that the --force flag on "epadmin reindex" will force the indexcodes files to be regenerated and make use of this new mappings (see https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feprints%2Feprints3.4%2Fissues%2F321&data=05%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C8bfc21ab6040499e8c3008db3f25d3cf%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638173203919508066%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=t664NLcU7SKvuEAHqVuAr3rHeMAVcv%2FyyulwoH6j%2Ft8%3D&reserved=0) if you do not want to use the new version of epadmin.  Using the "Reindex Item" button in the web interface should achieve the same thing.  If you see my earlier emails to Mario on the EPrints Tech list, you will see I was a little baffled why indexcodes files were only re-generated this was and not currently when using epadmin.

Anyway, with either solution, make sure that the indexer is restarted to apply the changes made.  (If you intend to use the "Reindex Item" button I would also reload the webserver just to be sure). Not restarting will not affect you initial use of "epadmin reindex" for specific eprints you want to test/fix but will prevent the changes being applied for future indexing tasks carried out by the indexer.


David Newman

On 17/04/2023 12:51 am, Agung Prasetyo W. via Eprints-tech wrote:
CAUTION: This e-mail originated outside the University of Southampton.

When I running command : epadmin reindex *repository_id* *dataset_id* [*eprint_id*]

I got an error like this :
Indexed item: eprint/7039
DBD::mysql::st execute failed: Incorrect string value: '\xF0\x9D\x91\x9F13' for column 'word' at row 1 at /usr/share/eprints3/bin/../perl_lib/EPrints/Database.pm line 1287.
Indexed item: eprint/7040
Indexed item: eprint/7041
DBD::mysql::st execute failed: Incorrect string value: '\xF0\x9D\x91\xA6\xF0\x9D...' for column 'word' at row 1 at /usr/share/eprints3/bin/../perl_lib/EPrints/Database.pm line 1287.
Indexed item: eprint/7042
Indexed item: eprint/7043
DBD::mysql::st execute failed: Incorrect string value: '\xF0\x9D\x90\xBF\xF0\x9D...' for column 'word' at row 1 at /usr/share/eprints3/bin/../perl_lib/EPrints/Database.pm line 1287.
Indexed item: eprint/7044
DBD::mysql::st execute failed: Incorrect string value: '\xF0\x9D\x91\xA1\xF0\x9D...' for column 'word' at row 1 at /usr/share/eprints3/bin/../perl_lib/EPrints/Database.pm line 1287.
Indexed item: eprint/7045
Indexed item: eprint/7046
DBD::mysql::st execute failed: Incorrect string value: '\xF0\x9D\x91\x9D\xF0\x9D...' for column 'word' at row 1 at /usr/share/eprints3/bin/../perl_lib/EPrints/Database.pm line 1287.
Indexed item: eprint/7047

Is there any solution for this problem?

Thank you.

Agung PW

*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&data=05%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C8bfc21ab6040499e8c3008db3f25d3cf%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638173203919508066%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=tTQXq5DY%2FA7Y%2FQlTpzVNjcbf31h%2F44VTCVb1aXeRvrM%3D&reserved=0
*** EPrints community wiki: https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&data=05%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C8bfc21ab6040499e8c3008db3f25d3cf%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638173203919508066%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=orExOaHoR%2FOX4UnxnxjTEXGlI9rgYrRr8fezqH0ygWQ%3D&reserved=0

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20230417/ecae468a/attachment.html