[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[EP-tech] Re: Search Index Troubles



Hi Tim,

   Looks good; works as expected for us here. '\w' also seems to include underscore and a few other word-like-chars, which is quite handy in this case.

Cheers,
Casey

-----Original Message-----
From: eprints-tech-bounces at ecs.soton.ac.uk [mailto:eprints-tech-bounces at ecs.soton.ac.uk] On Behalf Of Tim Brody
Sent: May-01-12 12:27 PM
To: eprints-tech at ecs.soton.ac.uk
Subject: [EP-tech] Re: Search Index Troubles

Hi,

Can you try this change:
http://trac.eprints.org/eprints/changeset/7669

(\w is any Unicode character/number)

/Tim.

On Mon, 2012-04-30 at 18:38 +0000, rchilliard at mun.ca wrote:
> Hi All,
> 
>  
> 
>    Over the last few days, we've been sorting out a few kinks with the 
> with fulltext searching / index creation on our local EPrints 
> repository and thought I'd pass along the notes in the hopes that it 
> might help out others. The issues were noted upon performing the query 
> noted by Paolo Tealdi a few days back seeking malformed content in the 
> eprint index table:
> 
>  
> 
> select *,length(word) from eprint__rindex where length(word) > 35
> 
>  
> 
> In our local results we noted an number of 'word' values corresponding 
> to eprints with pdf documents in which series of valid words were 
> string together with assorted Unicode interspersed.
> 
>  
> 
> The offending / troublesome Unicode values interspersed were inserted 
> in the export from pdf to text, as called by eprints to generate the 
> source fulltext to be indexed (called as '$(pdftotext) -enc UTF-8 
> -layout $(SOURCE) $(TARGET)'). Owing to the '-layout' argument, many 
> spaces, line endings and paragraph endings were converted to UTF-8 
> formatting characters not handled by the default tokenizer (e.g. space 
> to 'NON BREAKING SPACE' "chr(0x0a)", line ending to 'LINE SEPARATOR' - 
> "\x{2028}" and paragraph ending to 'PARAGRAPH SEPARATOR' - 
> "\x{2029}").
> 
>  
> 
> These are easily identifiable for insertion into the list of 
> delimiters, however, it seems that the list of delimiters
> ('FREETEXT_SEPERATOR_CHARS') is defined in both 
> ~eprints/archives/{archiveid}/cfg/cfg.d/indexing.pl and 
> ~eprints/perl_lib/EPrints/Index/Tokenizer.pm, only the latter of which 
> appears to have any effect. (The former may be orphaned code specific 
> to our repository)
> 
>  
> 
> As may also be of note - in our case, resetting the indexed values 
> seemed to require reloading the config (restarting apache and the 
> indexer - to update Tokenizer.pm), as well as dropping the contents of 
> the eprint__rindex table all before finally running epadmin 
> erase_fulltext_index. To any who might be having their search 
> misbehave, hopefully this may be of some help - any warnings, 
> criticisms or comments welcome!
> 
>  
> 
> NB: as our config could differ significantly from those out there, it 
> might be best to test the above on a non-critical / test repository if 
> it is of interest to you.
> 
>  
> 
> Cheers,
> 
> Casey
> 
>  
> 
> Casey Hilliard
> 
> PC Consultant,
> 
> Health Sciences Library / QE2 Systems,
> 
> Memorial University
> 
> Phone: 709-777-2387 (HSL)
> 
> Phone: 709-864-6267 (QE2)
> 
>  
> 
> This communication is intended as a private communication for the sole 
> use of the primary addressee. The information contained herein is 
> private and confidential. If you are not the intended receipient, you 
> are hereby notified that copying, forwarding or other dissemination or 
> distribution of this communication by any means is prohibited. If you 
> are not specifically authorized to receive this communication and you 
> believe that you have received it in error, please notify the original 
> sender immediately.
> 
>  
> 
> 
> 
> This electronic communication is governed by the terms and conditions 
> at 
> http://www.mun.ca/cc/policies/electronic_communications_disclaimer_201
> 2.php
> *** Options: 
> http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
> *** Archive: http://www.eprints.org/tech.php/
> *** EPrints community wiki: http://wiki.eprints.org/


This electronic communication is governed by the terms and conditions at
http://www.mun.ca/cc/policies/electronic_communications_disclaimer_2012.php