EPrints Technical Mailing List Archive

Message: #07197


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

Re: [EP-tech] A specific eprint doesn't get indexed ,


We've dealt with this over the years, too.  Some pointers, which might
be difficult depending on your situation:

1. Make sure the relevant columns/tables/database uses a Unicode
encoding (currently the 3.3 branch is set up for 'utf8', but I've
migrated ours to 'utf8mb4') -- this involves both:

   2) making sure the EPrints code uses the right encoding parameters
in all its database queries (not just EPrints::Database and
EPrints::Database::mysql, but also any other library or package that
handles its own database connections), and

   b) ensuring that any existing database tables are converted
correctly (see:
https://dev.mysql.com/doc/refman/5.7/en/alter-table.html#alter-table-character-set
)

2. Make sure the connection to the database uses a Unicode encoding;
for example:

   * https://github.com/eprints/eprints/blob/3.3/perl_lib/EPrints/Database/mysql.pm#L242

   * https://github.com/eprints/eprints/blob/3.3/perl_lib/EPrints/Database.pm#L164

   * https://dev.mysql.com/doc/refman/5.7/en/mysql-command-options.html#option_mysql_default-character-set

3. Making sure EPrints/perl handles Unicode strings correctly and
consistently.  It's a bit of a pain, but we're working at it!

Cheers


On 3 March 2018 at 10:53, David R Newman <drn@ecs.soton.ac.uk> wrote:
> Hi Avi,
>
> I have noted this issue happening quite a lot as well.  I have tracked it
> down to an issue indexing PDF documents where the extracted word to be
> indexed contains non-ascii characters.  If the whole word is non-ascii
> characters, basically the empty string gets indexed, if there is more than
> one word that is all non-ascii characters, then it fails with the error you
> see below, as it cannot index the empty string twice for the same EPrint and
> field (i.e. documents).  This is because the eprint__rindex table has three
> fields that make up a primary key, field, word and eprintid. As the middle
> one is not set that is is why you see documents--91 rather than something
> like documents-word-91 in your error message.
>
> As far as I can tell, this just effects this one badly encoded word from
> getting indexed rather than preventing all indexing for the whole EPrint.  I
> have tested this by writing a script to completely de-index an EPrint and
> then ran reindex,  I could see the records disappeared from the
> eprint__rindex table and then reappear again after the reindex.
>
> I am going to see if I can get the encoding issue sorted out, as this is
> likely to be problematic for people who are indexing publications with
> non-Latin alphabets.  However, this is never straightforward, based on past
> experience.
>
> Regards
>
> David Newman
>
>
> On 02/03/2018 10:53, Stenger, Avischai wrote:
>
>
> Hello 2 all,
>
> i have some eprints that do not get rindexed. If i execute, as an example:
>
> ~/bin/epadmin reindex REPO eprint 91
>
> i get The error:
>
> DBD::mysql::st execute failed: Duplicate entry 'documents--91' for key
> 'PRIMARY' at /usr/share/eprints/bin/../perl_lib/EPrints/Database.pm line
> 1287.
>
>
>
> i noticed that if i replace the PDF-Document in this eprint  i can indexed
> it without any Error-message.
>
> if i check the PDF with some open-pdf-checker it says the PDF ist okay.
> (https://www.pdf-online.com/osa/validate.aspx)
>
>
> tnks and have a good weekend
>
>
> Avi
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
> *** Archive: http://www.eprints.org/tech.php/
> *** EPrints community wiki: http://wiki.eprints.org/
> *** EPrints developers Forum: http://forum.eprints.org/
>
>
>
> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
> *** Archive: http://www.eprints.org/tech.php/
> *** EPrints community wiki: http://wiki.eprints.org/
> *** EPrints developers Forum: http://forum.eprints.org/
>



-- 
  Matthew Kerwin
  https://matthew.kerwin.net.au/