EPrints Technical Mailing List Archive

Message: #00464


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

[EP-tech] Re: Search Index Troubles


Hi,

Can you try this change:
http://trac.eprints.org/eprints/changeset/7669

(\w is any Unicode character/number)

/Tim.

On Mon, 2012-04-30 at 18:38 +0000, rchilliard@mun.ca wrote:
> Hi All, 
> 
>  
> 
>    Over the last few days, we've been sorting out a few kinks with the
> with fulltext searching / index creation on our local EPrints
> repository and thought I'd pass along the notes in the hopes that it
> might help out others. The issues were noted upon performing the query
> noted by Paolo Tealdi a few days back seeking malformed content in the
> eprint index table:
> 
>  
> 
> select *,length(word) from eprint__rindex where length(word) > 35
> 
>  
> 
> In our local results we noted an number of 'word' values corresponding
> to eprints with pdf documents in which series of valid words were
> string together with assorted Unicode interspersed. 
> 
>  
> 
> The offending / troublesome Unicode values interspersed were inserted
> in the export from pdf to text, as called by eprints to generate the
> source fulltext to be indexed (called as '$(pdftotext) -enc UTF-8
> -layout $(SOURCE) $(TARGET)'). Owing to the '-layout' argument, many
> spaces, line endings and paragraph endings were converted to UTF-8
> formatting characters not handled by the default tokenizer (e.g. space
> to 'NON BREAKING SPACE' "chr(0x0a)", line ending to 'LINE SEPARATOR' -
> "\x{2028}" and paragraph ending to 'PARAGRAPH SEPARATOR' -
> "\x{2029}"). 
> 
>  
> 
> These are easily identifiable for insertion into the list of
> delimiters, however, it seems that the list of delimiters
> ('FREETEXT_SEPERATOR_CHARS') is defined in both
> ~eprints/archives/{archiveid}/cfg/cfg.d/indexing.pl and
> ~eprints/perl_lib/EPrints/Index/Tokenizer.pm, only the latter of which
> appears to have any effect. (The former may be orphaned code specific
> to our repository)
> 
>  
> 
> As may also be of note - in our case, resetting the indexed values
> seemed to require reloading the config (restarting apache and the
> indexer - to update Tokenizer.pm), as well as dropping the contents of
> the eprint__rindex table all before finally running epadmin
> erase_fulltext_index. To any who might be having their search
> misbehave, hopefully this may be of some help - any warnings,
> criticisms or comments welcome! 
> 
>  
> 
> NB: as our config could differ significantly from those out there, it
> might be best to test the above on a non-critical / test repository if
> it is of interest to you.
> 
>  
> 
> Cheers,
> 
> Casey
> 
>  
> 
> Casey Hilliard
> 
> PC Consultant, 
> 
> Health Sciences Library / QE2 Systems,
> 
> Memorial University
> 
> Phone: 709-777-2387 (HSL)
> 
> Phone: 709-864-6267 (QE2)
> 
>  
> 
> This communication is intended as a private communication for the sole
> use of the primary addressee. The information contained herein is
> private and confidential. If you are not the intended receipient, you
> are hereby notified that copying, forwarding or other dissemination or
> distribution of this communication by any means is prohibited. If you
> are not specifically authorized to receive this communication and you
> believe that you have received it in error, please notify the original
> sender immediately.
> 
>  
> 
> 
> 
> This electronic communication is governed by the terms and conditions
> at
> http://www.mun.ca/cc/policies/electronic_communications_disclaimer_2012.php 
> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
> *** Archive: http://www.eprints.org/tech.php/
> *** EPrints community wiki: http://wiki.eprints.org/

Attachment: signature.asc
Description: This is a digitally signed message part