[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[EP-tech] indexing full text document pdf with tag get empty word in eprint__rindex table



Hi Mario,

I have never seen empty words indexed but I have seen the Duplicate 
entry error message you reported.? If you would be happy to share the 
PDF for your eprint 75 with me (drn at ecs.soton.ac.uk), that would be 
really useful as that would give a reliable test case.? I will only 
upload it to my EPrints development repository, which is not publicly 
accessible.

My suspicion is that the word may not initially be empty but contains a 
character that cannot be stored in the database, so when the 
eprint__rindex record is inserted the word becomes empty.? Words to be 
indexed from a PDF are extracted by the Unix tool pdftotext. You could 
try running pdftotext against your PDF and see if the output gives you 
any clues to why an empty word is being indexed. Index codes may contain 
words that are not indexed for a number of reasons.? They may be stop 
words, the word may be too short (I know in some case two character 
words are not index) or as I said before they have special characters 
that causes and issue when trying to add to the eprint__rindex database 
table.? However, you may be right that something fails and then prevents 
all further indexing of that document.

One thing you can do is temporarily run a MySQL query log for all 
queries (Adding config to /etc/my.cnf.d temporarily and restarting 
MySQL) and then try running epadmin reindex.? There will be quite a lot 
of output but it may tell you something useful.

Regards

David Newman

On 05/04/2023 8:45 pm, Beaudoin, Mario via Eprints-tech wrote:
> *CAUTION:* This e-mail originated outside the University of Southampton.
>
> Hello,
>
> We use eprints 3.4.3 and ?14 repositories with a lot of pdf on each 
> and we have an indexing bug with tagged PDF.
>
> All the eprint__rindex tables with theses PDF got an empty ?word?. 
> ?They index only some documents word not all.
>
> The sql command give to me all the bugged PDF
>
> select * from eprint__rindex where word='';
>
> The reindex of these eprint makes an error
>
> ./epadmin reindex eprints_fra1 --verbose eprint 75;
>
> DBD::mysql::st execute failed: Duplicate entry 'documents--75' for key 
> 'PRIMARY' at /opt/eprints3/bin/../perl_lib/EPrints/Database.pm line 1289.
>
> I try to modify the file indexing.pl but they already include bypass 
> empty word.
>
> I check the indexcodes.txt for this document and it is complete with a 
> lot of words not include in the database eprint__rindex
>
> I download another PDF document for these eprints not tagged reindex 
> the document with no error.
>
> I think that epadmin reindex ?got some empty word and stop to index 
> soon as it got another empty word because it indexes some words but 
> not all. The double ?verbose of the function gives that.
>
> [eprints_fra1] Database execute debug: SELECT 
> `eprintid`,`pos`,`projects` FROM `eprint_projects` WHERE `eprintid` IN 
> (75)
>
> Database execute debug: SELECT `eprintid`,`pos`,`skill_areas` FROM 
> `eprint_skill_areas` WHERE `eprintid` IN (75)
>
> [eprints_fra1] Database execute debug: SELECT 
> `eprintid`,`pos`,`skill_areas` FROM `eprint_skill_areas` WHERE 
> `eprintid` IN (75)
>
> Database execute debug: INSERT INTO `eprint__rindex` 
> (`eprintid`,`field`,`word`) VALUES (?,?,?)
>
> [eprints_fra1] Database execute debug: INSERT INTO `eprint__rindex` 
> (`eprintid`,`field`,`word`) VALUES (?,?,?)
>
> DBD::mysql::st execute failed: Duplicate entry 'documents--75' for key 
> 'PRIMARY' at /opt/eprints3/bin/../perl_lib/EPrints/Database.pm line 1289.
>
> Database execute debug: INSERT INTO `eprint__rindex` 
> (`eprintid`,`field`,`word`) VALUES (?,?,?)
>
> [eprints_fra1] Database execute debug: INSERT INTO `eprint__rindex` 
> (`eprintid`,`field`,`word`) VALUES (?,?,?)
>
> Database execute debug: INSERT INTO `eprint__index_grep` 
> (`eprintid`,`fieldname`,`grepstring`) VALUES (?,?,?)
>
> [eprints_fra1] Database execute debug: INSERT INTO 
> `eprint__index_grep` (`eprintid`,`fieldname`,`grepstring`) VALUES (?,?,?)
>
> Database execute debug: INSERT INTO `eprint__rindex` 
> (`eprintid`,`field`,`word`) VALUES (?,?,?)
>
> I notice that if we take the bugged .pdf to .PS and convert to .pdf it 
> work fine but it?s not a solution for us
>
> Thank you for your help
>
> Mario
>
>
> *** Options:http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
> *** Archive:https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&data=05%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cc3267491a8cd4d895d6908db361406d7%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638163231854979422%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=eO1s45r%2FCumXP6%2BTzgTdJDf%2Feh%2Bce5IJECFQhwADqeo%3D&reserved=0
> *** EPrints community wiki:https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&data=05%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cc3267491a8cd4d895d6908db361406d7%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C638163231854979422%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=nz5nM4I0REUl3yhLsbMWCsZ5S34uMsymHYubAZNEY%2B0%3D&reserved=0
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20230405/c172ebc0/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 6362 bytes
Desc: not available
Url : http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20230405/c172ebc0/attachment-0001.png