EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #00495

[EP-tech] Re: Garbage indexing some pdf

To: "eprints-tech@ecs.soton.ac.uk" <eprints-tech@ecs.soton.ac.uk>
Subject: [EP-tech] Re: Garbage indexing some pdf
From: Matthew Kerwin <matthew.kerwin@qut.edu.au>
Date: Fri, 4 May 2012 09:41:31 +1000

Is there a reason we don't ignore the $EPrints::Index::FREETEXT_SEPERATOR_CHARS hash altogether, and just set:
  $EPrints::Index::FREETEXT_SEPERATOR_REGEXP = qr/[^\p{L}\p{N}']/;
?  Note, I based this pattern on the comment:
  # Chars which seperate words. Pretty much anything except
  # A-Z a-z 0-9 and single quote '

Also, separate/separator is spelled wrong. As long as everyone's aware how awkward this makes me file while working in this part of the codebase.

Cheers,
Matty

-----Original Message-----
From: eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk] On Behalf Of Paolo Tealdi
Sent: Thursday, 3 May 2012 18:02
To: eprints-tech@ecs.soton.ac.uk
Subject: [EP-tech] Re: Garbage indexing some pdf

On 04/27/2012 02:02 PM, rchilliard@mun.ca wrote:
Hi,

regarding to your p.s., we noticed on our repository many records with 
words badly indexed (with  non-breaking space character, or other 
similar stuff).
A  (very) dirty  quick patch for Tokenize.pm to add the most frequent 
breaking characters found in our fulltext.

Index: Tokenizer.pm
===================================================================
--- Tokenizer.pm    (revision 323)
+++ Tokenizer.pm    (working copy)
@@ -259,8 +259,63 @@
      '.' => 1,     '/' => 1,     ':' => 1,     ';' => 1,
      '{' => 1,     '<' => 1,     '|' => 1,     '=' => 1,
      '}' => 1,     '>' => 1,     '~' => 1,     '?' => 1,
-    chr(0xb4) => 1, chr(0x27)=>1,   '{' => 1,       '}' => 1  # Acute 
Accent (closing quote)
+    chr(0xb4) => 1, chr(0x27)=>1,   '{' => 1,       '}' => 1,
+chr(0x81) => 1,
+chr(0x83) => 1,
+chr(0x00a0) => 1,
+chr(0x0090) => 1,
+chr(0x0099)  => 1,
+chr(0x009c)  => 1,
+chr(0x009d) => 1,
+chr(0x02B9) => 1, # ca b9    MODIFIER LETTER PRIME
+chr(0x02BA) => 1, # ca ba    MODIFIER LETTER DOUBLE PRIME
+chr(0x02BB) => 1, # ca bb    MODIFIER LETTER TURNED COMMA
+chr(0x02BC) => 1, # ca bc       MODIFIER LETTER APOSTROPHE
+chr(0x02BD) => 1, # ca bd    MODIFIER LETTER REVERSED COMMA
+chr(0x02BE) => 1, # ca be    MODIFIER LETTER RIGHT HALF RING
+chr(0x02BF) => 1, # ca bf    MODIFIER LETTER LEFT HALF RING
+chr(0x2000) => 1, # e2 80 80    EN QUAD
+chr(0x2001) => 1, # e2 80 81    EM QUAD
+chr(0x2002) => 1, # e2 80 82    EN SPACE
+chr(0x2003) => 1, # e2 80 83    EM QUAD
+chr(0x2004) => 1, # e2 80 84    THREE-PER-EM SPACE
+chr(0x2005) => 1, # e2 80 85    FOUR-PER-EM SPACE
+chr(0x2006) => 1, # e2 80 86    SIX-PER-EM SPACE
+chr(0x2007) => 1, # e2 80 87    FIGURE SPACE
+chr(0x2008) => 1, # e2 80 87    PUNCTUATION SPACE
+chr(0x2009) => 1, # e2 80 87    THIN SPACE
+chr(0x200A) => 1, # e2 80 87    HAIR SPACE
+chr(0x200B) => 1, # e2 80 87    ZERO WIDTH SPACE
+chr(0x2024)  => 1, # e2 80 a4    ONE DOT LEADER
+chr(0x2025)  => 1, # e2 80 a5   TWO DOT LEADER
+chr(0x2026)  => 1, # e2 80 a6   HORIZONTAL ELLIPSIS
+chr(0x2027)  => 1, # e2 80 a7   HYPHENATION POINT
+chr(0x2028)  => 1, # e2 80 a8   LINE SEPARATOR
+chr(0x2029)  => 1, # e2 80 a9   PARAGRAPH SEPARATOR
+chr(0x2018) => 1,  # e2 80 98    LEFT SINGLE QUOTATION MA
+chr(0x2019) => 1, # e2 80 99    RIGHT SINGLE QUOTATION MARK
+chr(0x201c) => 1, # e2 80 9c    LEFT DOUBLE QUOTATION MARK
+chr(0x201d) => 1,  # e2 80 9d    RIGHT DOUBLE QUOTATION MARK
+chr(0x2010) => 1,  # e2 80 90    HYPHEN
+chr(0x2011) => 1,  # e2 80 91    NON-BREAKING HYPHEN
+chr(0x2012) => 1,  # e2 80 92    FIGURE DASH
+chr(0x2013) => 1,  # e2 80 93    EN DASH
+chr(0x2014) => 1,  # e2 80 94    EM DASH
+chr(0x2015) => 1,  # e2 80 95    HORIZONTAL BAR
+chr(0xFB00) => 1,  #ef ac 80    LATIN SMALL LIGATURE FF
+chr(0xFB01) => 1,  #ef ac 81    LATIN SMALL LIGATURE FI
+chr(0xFB02) => 1,  #ef ac 82    LATIN SMALL LIGATURE FL
+chr(0xFB03) => 1,  #ef ac 83    LATIN SMALL LIGATURE FFI
+chr(0xFB04) => 1,  #ef ac 84    LATIN SMALL LIGATURE FFL
+chr(0xFB05) => 1,  #ef ac 85    LATIN SMALL LIGATURE LONG S T
+chr(0xFB06) => 1,  #ef ac 86    LATIN SMALL LIGATURE ST
+chr(0xFFF9 ) => 1,  #ef bf b9 INTERLINEAR ANNOTATION ANCHOR
+chr(0xFFFA ) => 1,  #ef bf ba INTERLINEAR ANNOTATION SEPARATOR
+chr(0xFFFB ) => 1,  #ef bf bb INTERLINEAR ANNOTATION TERMINATOR
+chr(0xFFFC ) => 1,  #ef bf bc OBJECT REPLACEMENT CHARACTER
+chr(0xFFFD ) => 1  #ef bf bd REPLACEMENT CHARACTER
  };
+
  $EPrints::Index::FREETEXT_SEPERATOR_REGEXP = quotemeta(join "", keys 
%$EPrints::Index::FREETEXT_SEPERATOR_CHARS);
  $EPrints::Index::FREETEXT_SEPERATOR_REGEXP = 
qr/[$EPrints::Index::FREETEXT_SEPERATOR_REGEXP\x00-\x20]/;

Best regards,
Paolo
> Hi Paolo,
>
>     I took a quick peek at the sample that you were able to provide, and it looks like the character mapping is missing for the content text. If you export the PDF to text via Acrobat or equivalent, you can note via hex editor that the output text file has all characters mapped to ascii(0x2e), via a vanilla run of pdftotext (e.g. pdftotext test.pdf test.txt), the characters are mapped to ascii(0x20) and in unicode from pdftotext (as the command run by the indexer ~= pdftotext -enc UTF-8 -test.pdf test_utf.txt) you get the byte sequence "ef 80 bd" for each character.
>
>     It may be possible to retroactively reconstitute the mapping information, but I'm not aware of a mechanism to do perform that operation. As well, it appears that this might have been done purposely when the PDF was generated - most tellingly, the licensing / attribution information at the conclusion of the file is mapped properly.
>
> p.s. thank you for the note / query on testing the indexed word lengths, it has notified us of a potential issue in our repository (and possibly others'?) whereby multiple words are being indexed in clusters because they are not tokenized on non-breaking space ('&nbsp') characters.
>

-- 
Ing. Paolo Tealdi         Area IT - Politecnico Torino
Telefono/Phone : +39-011-0906714 , FAX : +39-011-0906799
Indirizzo/Address : C.so Duca degli Abruzzi,  24 - 10129 Torino - ITALY
Skype : tealdi.paolo
Please consider your environmental responsibility before printing this e-mail

*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: http://www.eprints.org/tech.php/
*** EPrints community wiki: http://wiki.eprints.org/

References:
- [EP-tech] Garbage indexing some pdf
  - From: Paolo Tealdi <paolo.tealdi@polito.it>
- [EP-tech] Re: Garbage indexing some pdf
  - From: "Manojlovich, Slavko" <slavko@mun.ca>
- [EP-tech] Re: Garbage indexing some pdf
  - From: Paolo Tealdi <paolo.tealdi@polito.it>
- [EP-tech] Re: Garbage indexing some pdf
  - From: Paolo Tealdi <paolo.tealdi@polito.it>
- [EP-tech] Re: Garbage indexing some pdf
  - From: <rchilliard@mun.ca>
- [EP-tech] Re: Garbage indexing some pdf
  - From: Paolo Tealdi <paolo.tealdi@polito.it>

Prev by Date: [EP-tech] simple search vs advanced search
Next by Date: [EP-tech] Re: Eprints error Language code unknown
Previous by thread: [EP-tech] Re: Garbage indexing some pdf
Next by thread: [EP-tech] Search Index Troubles
Index(es):
- Date
- Thread