[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[EP-tech] Search Index Troubles

Hi All,

   Over the last few days, we've been sorting out a few kinks with the with fulltext searching / index creation on our local EPrints repository and thought I'd pass along the notes in the hopes that it might help out others. The issues were noted upon performing the query noted by Paolo Tealdi a few days back seeking malformed content in the eprint index table:

select *,length(word) from eprint__rindex where length(word) > 35

In our local results we noted an number of 'word' values corresponding to eprints with pdf documents in which series of valid words were string together with assorted Unicode interspersed.

The offending / troublesome Unicode values interspersed were inserted in the export from pdf to text, as called by eprints to generate the source fulltext to be indexed (called as '$(pdftotext) -enc UTF-8 -layout $(SOURCE) $(TARGET)'). Owing to the '-layout' argument, many spaces, line endings and paragraph endings were converted to UTF-8 formatting characters not handled by the default tokenizer (e.g. space to 'NON BREAKING SPACE' "chr(0x0a)", line ending to 'LINE SEPARATOR' - "\x{2028}" and paragraph ending to 'PARAGRAPH SEPARATOR' - "\x{2029}").

These are easily identifiable for insertion into the list of delimiters, however, it seems that the list of delimiters ('FREETEXT_SEPERATOR_CHARS') is defined in both ~eprints/archives/{archiveid}/cfg/cfg.d/indexing.pl and ~eprints/perl_lib/EPrints/Index/Tokenizer.pm, only the latter of which appears to have any effect. (The former may be orphaned code specific to our repository)

As may also be of note - in our case, resetting the indexed values seemed to require reloading the config (restarting apache and the indexer - to update Tokenizer.pm), as well as dropping the contents of the eprint__rindex table all before finally running epadmin erase_fulltext_index. To any who might be having their search misbehave, hopefully this may be of some help - any warnings, criticisms or comments welcome!

NB: as our config could differ significantly from those out there, it might be best to test the above on a non-critical / test repository if it is of interest to you.


Casey Hilliard
PC Consultant,
Health Sciences Library / QE2 Systems,
Memorial University
Phone: 709-777-2387 (HSL)
Phone: 709-864-6267 (QE2)

This communication is intended as a private communication for the sole use of the primary addressee. The information contained herein is private and confidential. If you are not the intended receipient, you are hereby notified that copying, forwarding or other dissemination or distribution of this communication by any means is prohibited. If you are not specifically authorized to receive this communication and you believe that you have received it in error, please notify the original sender immediately.

This electronic communication is governed by the terms and conditions at
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20120430/7db60dff/attachment-0001.html