[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[EP-tech] indexing full text document pdf with tag get empty word in eprint__rindex table
Hi Mario,
I think I may have a solution to your problem.? What you need to do is
edit you archive's cfg/cfg.d/indexing.pl and add the following block of
code:
??? ??? if( $word =~ m/[^\x20-\xEF]/ )
??????? {
??????????? $ok=0;
??????? }
If you put this after the block of code like this:
??? ??? if( $word =~ m/^[A-Z][A-Z0-9]+$/ )
??????? {
??????????? $ok=1;
??????? }
If you do not have a cfg/cfg.d/indexing.pl in your archive copy the one
in lib/cfg.d/ to your archive and then edit this copy.? Once you have
finished editing make sure you restart the indexer and probably best to
reload the webserver as well.? Oddly, if you try to? run "epadmin
reindex" for this eprint from the command line it won't work, as it will
reuse the existing indexcodes.txt file that has the offending strings.?
However, if you click in the "Reindex Item" under the "Actions" tab of
the eprint item it should regenerate the indexcodes.txt file.? I am
going to look into whether this can be added as a feature to epadmin.?
At present epadmin has the option "erase_fulltext_index" but this will
cause all your indexcodes.txt files to be regenerated and all eprint
items to be reindexed.
The aim of this code is to avoid strings like '???' attempting to be
indexed.? As these will fail due to MySQL ignoring certain characters
and then trying to insert empty string into the eprint__rindex table,
when there is already an entry for empty string, probably due to a
previous string that had non-standard characters being saved as an empty
string.? The way that the SQL queries are written to insert multiple
entries at once into the eprint__rindex table means that if one entry
fails then the remaining entries never get attempted.
With this block of code it should not allow any strings into the
indexcodes.txt that would break the inserts into the eprint__rindex
table.? However, I am not yet confident this is the best regular
expression to use, as it may disqualify perfectly legitimate strings.?
However, it looks like all printable ASCII characters are allowed as
well as various non-English characters that may be used both Latin
alphabet languages (French, German, Spanish, Italian, etc.) and
non-Latin alphabet Languages (Chinese, Japanese, Korean and
Cyrillic-based).? I am not sure if ultimately some of the non-Latin
characters strings will get indexed but they will be added to the
indexcodes.txt files, which is all that the block of code should
directly affect.
Regards
David Newman
On 05/04/2023 11:16 pm, David R Newman wrote:
> Hi Mario,
>
> I managed to find you document with a bit of detective work.? I found
> the query causing the duplicate error line was:
>
> INSERT INTO `eprint__rindex` (`eprintid`,`field`,`word`) VALUES
> ('75','documents','???')
>
> The '???' is special characters that do not get recognised by MySQL
> and therefore MySQL treats this as an empty string, which it cannot
> insert as there is already an empty string.? The first insert of any
> empty string was this:
>
> INSERT INTO `eprint__rindex` (`eprintid`,`field`,`word`) VALUES
> ('75','documents','')
>
> It looks like virtually every document tries to insert an
> eprint__rindex record for empty string.? So there are two problems to
> look at:
>
> 1. Why does EPrints bother trying to index an empty string?
> 2. Why can't MySQL interpret special characters and treat them as missing?
>
> I suspect the latter is down to character encoding.? If that is the
> case EPrints could try to check whether the special characters will
> not be writeable to the database.? However, I am not sure how easy it
> would be to do this and whether it would add significant effort to the
> indexing process.? The first thing to do would be to make sure that
> any SQL commands that lead to an error to not stop any further
> indexing of terms.? I think that fix should be a lot more
> straightforward.? I have created a general GitHub issue for
> investigating this:
>
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Feprints%2Feprints3.4%2Fissues%2F320&data=05%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C1088658e18bd4f2446b808db36925f06%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638163774503074575%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=G%2FpefbsUqHfYYBUvQTSkhM7X429igolW5EWCsjwJTnY%3D&reserved=0
>
> Regards
>
> David Newman
>
> On 05/04/2023 9:26 pm, David R Newman via Eprints-tech wrote:
>> Hi Mario,
>>
>> I have never seen empty words indexed but I have seen the Duplicate
>> entry error message you reported.? If you would be happy to share the
>> PDF for your eprint 75 with me (drn at ecs.soton.ac.uk), that would be
>> really useful as that would give a reliable test case.? I will only
>> upload it to my EPrints development repository, which is not publicly
>> accessible.
>>
>> My suspicion is that the word may not initially be empty but contains
>> a character that cannot be stored in the database, so when the
>> eprint__rindex record is inserted the word becomes empty.? Words to
>> be indexed from a PDF are extracted by the Unix tool pdftotext.? You
>> could try running pdftotext against your PDF and see if the output
>> gives you any clues to why an empty word is being indexed.? Index
>> codes may contain words that are not indexed for a number of
>> reasons.? They may be stop words, the word may be too short (I know
>> in some case two character words are not index) or as I said before
>> they have special characters that causes and issue when trying to add
>> to the eprint__rindex database table.? However, you may be right that
>> something fails and then prevents all further indexing of that document.
>>
>> One thing you can do is temporarily run a MySQL query log for all
>> queries (Adding config to /etc/my.cnf.d temporarily and restarting
>> MySQL) and then try running epadmin reindex.? There will be quite a
>> lot of output but it may tell you something useful.
>>
>> Regards
>>
>> David Newman
>>
>> On 05/04/2023 8:45 pm, Beaudoin, Mario via Eprints-tech wrote:
>>> *CAUTION:* This e-mail originated outside the University of
>>> Southampton.
>>>
>>> Hello,
>>>
>>> We use eprints 3.4.3 and ?14 repositories with a lot of pdf on each
>>> and we have an indexing bug with tagged PDF.
>>>
>>> All the eprint__rindex tables with theses PDF got an empty ?word?.
>>> ?They index only some documents word not all.
>>>
>>> The sql command give to me all the bugged PDF
>>>
>>> select * from eprint__rindex where word='';
>>>
>>> The reindex of these eprint makes an error
>>>
>>> ./epadmin reindex eprints_fra1 --verbose eprint 75;
>>>
>>> DBD::mysql::st execute failed: Duplicate entry 'documents--75' for
>>> key 'PRIMARY' at /opt/eprints3/bin/../perl_lib/EPrints/Database.pm
>>> line 1289.
>>>
>>> I try to modify the file indexing.pl but they already include bypass
>>> empty word.
>>>
>>> I check the indexcodes.txt for this document and it is complete with
>>> a lot of words not include in the database eprint__rindex
>>>
>>> I download another PDF document for these eprints not tagged reindex
>>> the document with no error.
>>>
>>> I think that epadmin reindex ?got some empty word and stop to index
>>> soon as it got another empty word because it indexes some words but
>>> not all. The double ?verbose of the function gives that.
>>>
>>> [eprints_fra1] Database execute debug: SELECT
>>> `eprintid`,`pos`,`projects` FROM `eprint_projects` WHERE `eprintid`
>>> IN (75)
>>>
>>> Database execute debug: SELECT `eprintid`,`pos`,`skill_areas` FROM
>>> `eprint_skill_areas` WHERE `eprintid` IN (75)
>>>
>>> [eprints_fra1] Database execute debug: SELECT
>>> `eprintid`,`pos`,`skill_areas` FROM `eprint_skill_areas` WHERE
>>> `eprintid` IN (75)
>>>
>>> Database execute debug: INSERT INTO `eprint__rindex`
>>> (`eprintid`,`field`,`word`) VALUES (?,?,?)
>>>
>>> [eprints_fra1] Database execute debug: INSERT INTO `eprint__rindex`
>>> (`eprintid`,`field`,`word`) VALUES (?,?,?)
>>>
>>> DBD::mysql::st execute failed: Duplicate entry 'documents--75' for
>>> key 'PRIMARY' at /opt/eprints3/bin/../perl_lib/EPrints/Database.pm
>>> line 1289.
>>>
>>> Database execute debug: INSERT INTO `eprint__rindex`
>>> (`eprintid`,`field`,`word`) VALUES (?,?,?)
>>>
>>> [eprints_fra1] Database execute debug: INSERT INTO `eprint__rindex`
>>> (`eprintid`,`field`,`word`) VALUES (?,?,?)
>>>
>>> Database execute debug: INSERT INTO `eprint__index_grep`
>>> (`eprintid`,`fieldname`,`grepstring`) VALUES (?,?,?)
>>>
>>> [eprints_fra1] Database execute debug: INSERT INTO
>>> `eprint__index_grep` (`eprintid`,`fieldname`,`grepstring`) VALUES
>>> (?,?,?)
>>>
>>> Database execute debug: INSERT INTO `eprint__rindex`
>>> (`eprintid`,`field`,`word`) VALUES (?,?,?)
>>>
>>> I notice that if we take the bugged .pdf to .PS and convert to .pdf
>>> it work fine but it?s not a solution for us
>>>
>>> Thank you for your help
>>>
>>> Mario
>>>
>>>
>>> *** Options:http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
>>> *** Archive:https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&data=05%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C1088658e18bd4f2446b808db36925f06%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638163774503230811%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=6h0DA5Qa%2FiUzV81euckQmkbnCxE8WnjWxTj5gmms9JQ%3D&reserved=0
>>> *** EPrints community wiki:https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&data=05%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C1088658e18bd4f2446b808db36925f06%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638163774503230811%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=InxxPAB8AR7qcI1a3YY%2FV8DkMQ5BriyUPqXO%2BxYRW5o%3D&reserved=0
>>
>>
>> *** Options:http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
>> *** Archive:https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&data=05%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C1088658e18bd4f2446b808db36925f06%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638163774503230811%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=6h0DA5Qa%2FiUzV81euckQmkbnCxE8WnjWxTj5gmms9JQ%3D&reserved=0
>> *** EPrints community wiki:https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&data=05%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C1088658e18bd4f2446b808db36925f06%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638163774503230811%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=InxxPAB8AR7qcI1a3YY%2FV8DkMQ5BriyUPqXO%2BxYRW5o%3D&reserved=0
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20230406/f7f71859/attachment-0001.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 6362 bytes
Desc: not available
Url : http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20230406/f7f71859/attachment-0001.png