EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #07198

Re: [EP-tech] A specific eprint doesn't get indexed ,

To: eprints-tech@ecs.soton.ac.uk
Subject: Re: [EP-tech] A specific eprint doesn't get indexed ,
From: David R Newman <drn@ecs.soton.ac.uk>
Date: Sat, 3 Mar 2018 18:49:24 +0000

Hi Matthew,

Thanks for the advice. That seems to work for the issue I observed. Togive an example walkthrough for Avi, I did the following:

1. Added the $dsn.= ";mysql_enable_utf8=1"; line just before the returnline of the build_connection_string method in perl_lib/EPrints/Database.pm

2. Changed $self->do("SET NAMES 'utf8'"); to $self->do("SET NAMES'utf8mb4'"); in connect method of perl_lib/EPrints/Database/mysql.pm

3. Ran the following commands at the MySQL prompt. (I am not sure of thecollate lines are needed but wanted to keep things consistent):


ALTER TABLE eprint__rindex CONVERT TO CHARACTER SET utf8mb4;

ALTER TABLE eprint__rindex modify column word varchar(128) not nullcollate 'utf8mb4_bin';

ALTER TABLE eprint__rindex modify column field varchar(64) not nullcollate 'utf8mb4_bin';

4. Ran my script to de-index the record. However, this should not benecessary but it was useful for me to confirm indexes are removed beforebeing re-added.


5. Ran epadmin reindex on the appropriate record.

6. Queried the database to make sure words that failed to be indexedsucceeded this time.

7. Did an advanced search using the documents field with one of thesenewly-indexed terms that the database query found to confirm the EPrintis returned as a result.

It is probably worth doing a complete reindex of all EPrint recordsusing epadmin reindex. This will acheive two things, test that theoriginal problem is resolved and make all EPrints searchable on theterms that were intended to be indexed.


Regards

David Newman


On 03/03/2018 01:45, Matthew Kerwin wrote:

We've dealt with this over the years, too.  Some pointers, which might
be difficult depending on your situation:

1. Make sure the relevant columns/tables/database uses a Unicode
encoding (currently the 3.3 branch is set up for 'utf8', but I've
migrated ours to 'utf8mb4') -- this involves both:

    2) making sure the EPrints code uses the right encoding parameters
in all its database queries (not just EPrints::Database and
EPrints::Database::mysql, but also any other library or package that
handles its own database connections), and

    b) ensuring that any existing database tables are converted
correctly (see:
https://dev.mysql.com/doc/refman/5.7/en/alter-table.html#alter-table-character-set
)

2. Make sure the connection to the database uses a Unicode encoding;
for example:

    * https://github.com/eprints/eprints/blob/3.3/perl_lib/EPrints/Database/mysql.pm#L242

    * https://github.com/eprints/eprints/blob/3.3/perl_lib/EPrints/Database.pm#L164

    * https://dev.mysql.com/doc/refman/5.7/en/mysql-command-options.html#option_mysql_default-character-set

3. Making sure EPrints/perl handles Unicode strings correctly and
consistently.  It's a bit of a pain, but we're working at it!

Cheers


On 3 March 2018 at 10:53, David R Newman <drn@ecs.soton.ac.uk> wrote:

Hi Avi,

I have noted this issue happening quite a lot as well.  I have tracked it
down to an issue indexing PDF documents where the extracted word to be
indexed contains non-ascii characters.  If the whole word is non-ascii
characters, basically the empty string gets indexed, if there is more than
one word that is all non-ascii characters, then it fails with the error you
see below, as it cannot index the empty string twice for the same EPrint and
field (i.e. documents).  This is because the eprint__rindex table has three
fields that make up a primary key, field, word and eprintid. As the middle
one is not set that is is why you see documents--91 rather than something
like documents-word-91 in your error message.

As far as I can tell, this just effects this one badly encoded word from
getting indexed rather than preventing all indexing for the whole EPrint.  I
have tested this by writing a script to completely de-index an EPrint and
then ran reindex,  I could see the records disappeared from the
eprint__rindex table and then reappear again after the reindex.

I am going to see if I can get the encoding issue sorted out, as this is
likely to be problematic for people who are indexing publications with
non-Latin alphabets.  However, this is never straightforward, based on past
experience.

Regards

David Newman


On 02/03/2018 10:53, Stenger, Avischai wrote:


Hello 2 all,

i have some eprints that do not get rindexed. If i execute, as an example:

~/bin/epadmin reindex REPO eprint 91

i get The error:

DBD::mysql::st execute failed: Duplicate entry 'documents--91' for key
'PRIMARY' at /usr/share/eprints/bin/../perl_lib/EPrints/Database.pm line
1287.



i noticed that if i replace the PDF-Document in this eprint  i can indexed
it without any Error-message.

if i check the PDF with some open-pdf-checker it says the PDF ist okay.
(https://www.pdf-online.com/osa/validate.aspx)


tnks and have a good weekend


Avi


























*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: http://www.eprints.org/tech.php/
*** EPrints community wiki: http://wiki.eprints.org/
*** EPrints developers Forum: http://forum.eprints.org/



*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: http://www.eprints.org/tech.php/
*** EPrints community wiki: http://wiki.eprints.org/
*** EPrints developers Forum: http://forum.eprints.org/

References:
- [EP-tech] A specific eprint doesn't get indexed ,
  - From: "Stenger, Avischai" <avischai.stenger@ulb.tu-darmstadt.de>
- Re: [EP-tech] A specific eprint doesn't get indexed ,
  - From: David R Newman <drn@ecs.soton.ac.uk>
- Re: [EP-tech] A specific eprint doesn't get indexed ,
  - From: Matthew Kerwin <matthew@kerwin.net.au>

Prev by Date: Re: [EP-tech] A specific eprint doesn't get indexed ,
Next by Date: Re: [EP-tech] A specific eprint doesn't get indexed ,
Previous by thread: [EP-tech] EPrints/CRIS
Next by thread: [EP-tech] DOI handling in orcid_support_advance
Index(es):
- Date
- Thread