[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[EP-tech] A specific eprint doesn't get indexed ,



Hi Matthew,

Thanks for the advice.  That seems to work for the issue I observed.  To 
give an example walkthrough for Avi, I did the following:

1. Added the $dsn.= ";mysql_enable_utf8=1"; line just before the return 
line of the build_connection_string method in perl_lib/EPrints/Database.pm

2. Changed $self->do("SET NAMES 'utf8'"); to $self->do("SET NAMES 
'utf8mb4'"); in connect method of perl_lib/EPrints/Database/mysql.pm

3. Ran the following commands at the MySQL prompt. (I am not sure of the 
collate lines are needed but wanted to keep things consistent):

ALTER TABLE eprint__rindex CONVERT TO CHARACTER SET utf8mb4;

ALTER TABLE eprint__rindex modify column word varchar(128) not null 
collate 'utf8mb4_bin';

ALTER TABLE eprint__rindex modify column field varchar(64) not null 
collate 'utf8mb4_bin';

4. Ran my script to de-index the record.  However, this should not be 
necessary but it was useful for me to confirm indexes are removed before 
being re-added.

5. Ran epadmin reindex on the appropriate record.

6. Queried the database to make sure words that failed to be indexed 
succeeded this time.

7. Did an advanced search using the documents field with one of these 
newly-indexed terms that the database query found to confirm the EPrint 
is returned as a result.

It is probably worth doing a complete reindex of all EPrint records 
using epadmin reindex.  This will acheive two things, test that the 
original problem is resolved and make all EPrints searchable on the 
terms that were intended to be indexed.

Regards

David Newman


On 03/03/2018 01:45, Matthew Kerwin wrote:
> We've dealt with this over the years, too.  Some pointers, which might
> be difficult depending on your situation:
>
> 1. Make sure the relevant columns/tables/database uses a Unicode
> encoding (currently the 3.3 branch is set up for 'utf8', but I've
> migrated ours to 'utf8mb4') -- this involves both:
>
>     2) making sure the EPrints code uses the right encoding parameters
> in all its database queries (not just EPrints::Database and
> EPrints::Database::mysql, but also any other library or package that
> handles its own database connections), and
>
>     b) ensuring that any existing database tables are converted
> correctly (see:
> https://dev.mysql.com/doc/refman/5.7/en/alter-table.html#alter-table-character-set
> )
>
> 2. Make sure the connection to the database uses a Unicode encoding;
> for example:
>
>     * https://github.com/eprints/eprints/blob/3.3/perl_lib/EPrints/Database/mysql.pm#L242
>
>     * https://github.com/eprints/eprints/blob/3.3/perl_lib/EPrints/Database.pm#L164
>
>     * https://dev.mysql.com/doc/refman/5.7/en/mysql-command-options.html#option_mysql_default-character-set
>
> 3. Making sure EPrints/perl handles Unicode strings correctly and
> consistently.  It's a bit of a pain, but we're working at it!
>
> Cheers
>
>
> On 3 March 2018 at 10:53, David R Newman <drn at ecs.soton.ac.uk> wrote:
>> Hi Avi,
>>
>> I have noted this issue happening quite a lot as well.  I have tracked it
>> down to an issue indexing PDF documents where the extracted word to be
>> indexed contains non-ascii characters.  If the whole word is non-ascii
>> characters, basically the empty string gets indexed, if there is more than
>> one word that is all non-ascii characters, then it fails with the error you
>> see below, as it cannot index the empty string twice for the same EPrint and
>> field (i.e. documents).  This is because the eprint__rindex table has three
>> fields that make up a primary key, field, word and eprintid. As the middle
>> one is not set that is is why you see documents--91 rather than something
>> like documents-word-91 in your error message.
>>
>> As far as I can tell, this just effects this one badly encoded word from
>> getting indexed rather than preventing all indexing for the whole EPrint.  I
>> have tested this by writing a script to completely de-index an EPrint and
>> then ran reindex,  I could see the records disappeared from the
>> eprint__rindex table and then reappear again after the reindex.
>>
>> I am going to see if I can get the encoding issue sorted out, as this is
>> likely to be problematic for people who are indexing publications with
>> non-Latin alphabets.  However, this is never straightforward, based on past
>> experience.
>>
>> Regards
>>
>> David Newman
>>
>>
>> On 02/03/2018 10:53, Stenger, Avischai wrote:
>>
>>
>> Hello 2 all,
>>
>> i have some eprints that do not get rindexed. If i execute, as an example:
>>
>> ~/bin/epadmin reindex REPO eprint 91
>>
>> i get The error:
>>
>> DBD::mysql::st execute failed: Duplicate entry 'documents--91' for key
>> 'PRIMARY' at /usr/share/eprints/bin/../perl_lib/EPrints/Database.pm line
>> 1287.
>>
>>
>>
>> i noticed that if i replace the PDF-Document in this eprint  i can indexed
>> it without any Error-message.
>>
>> if i check the PDF with some open-pdf-checker it says the PDF ist okay.
>> (https://www.pdf-online.com/osa/validate.aspx)
>>
>>
>> tnks and have a good weekend
>>
>>
>> Avi
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
>> *** Archive: http://www.eprints.org/tech.php/
>> *** EPrints community wiki: http://wiki.eprints.org/
>> *** EPrints developers Forum: http://forum.eprints.org/
>>
>>
>>
>> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
>> *** Archive: http://www.eprints.org/tech.php/
>> *** EPrints community wiki: http://wiki.eprints.org/
>> *** EPrints developers Forum: http://forum.eprints.org/
>>
>
>