EPrints Technical Mailing List Archive
Message: #00495
< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First
[EP-tech] Re: Garbage indexing some pdf
- To: "eprints-tech@ecs.soton.ac.uk" <eprints-tech@ecs.soton.ac.uk>
- Subject: [EP-tech] Re: Garbage indexing some pdf
- From: Matthew Kerwin <matthew.kerwin@qut.edu.au>
- Date: Fri, 4 May 2012 09:41:31 +1000
Is there a reason we don't ignore the $EPrints::Index::FREETEXT_SEPERATOR_CHARS hash altogether, and just set: $EPrints::Index::FREETEXT_SEPERATOR_REGEXP = qr/[^\p{L}\p{N}']/; ? Note, I based this pattern on the comment: # Chars which seperate words. Pretty much anything except # A-Z a-z 0-9 and single quote ' Also, separate/separator is spelled wrong. As long as everyone's aware how awkward this makes me file while working in this part of the codebase. Cheers, Matty -----Original Message----- From: eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk] On Behalf Of Paolo Tealdi Sent: Thursday, 3 May 2012 18:02 To: eprints-tech@ecs.soton.ac.uk Subject: [EP-tech] Re: Garbage indexing some pdf On 04/27/2012 02:02 PM, rchilliard@mun.ca wrote: Hi, regarding to your p.s., we noticed on our repository many records with words badly indexed (with non-breaking space character, or other similar stuff). A (very) dirty quick patch for Tokenize.pm to add the most frequent breaking characters found in our fulltext. Index: Tokenizer.pm =================================================================== --- Tokenizer.pm (revision 323) +++ Tokenizer.pm (working copy) @@ -259,8 +259,63 @@ '.' => 1, '/' => 1, ':' => 1, ';' => 1, '{' => 1, '<' => 1, '|' => 1, '=' => 1, '}' => 1, '>' => 1, '~' => 1, '?' => 1, - chr(0xb4) => 1, chr(0x27)=>1, '{' => 1, '}' => 1 # Acute Accent (closing quote) + chr(0xb4) => 1, chr(0x27)=>1, '{' => 1, '}' => 1, +chr(0x81) => 1, +chr(0x83) => 1, +chr(0x00a0) => 1, +chr(0x0090) => 1, +chr(0x0099) => 1, +chr(0x009c) => 1, +chr(0x009d) => 1, +chr(0x02B9) => 1, # ca b9 MODIFIER LETTER PRIME +chr(0x02BA) => 1, # ca ba MODIFIER LETTER DOUBLE PRIME +chr(0x02BB) => 1, # ca bb MODIFIER LETTER TURNED COMMA +chr(0x02BC) => 1, # ca bc MODIFIER LETTER APOSTROPHE +chr(0x02BD) => 1, # ca bd MODIFIER LETTER REVERSED COMMA +chr(0x02BE) => 1, # ca be MODIFIER LETTER RIGHT HALF RING +chr(0x02BF) => 1, # ca bf MODIFIER LETTER LEFT HALF RING +chr(0x2000) => 1, # e2 80 80 EN QUAD +chr(0x2001) => 1, # e2 80 81 EM QUAD +chr(0x2002) => 1, # e2 80 82 EN SPACE +chr(0x2003) => 1, # e2 80 83 EM QUAD +chr(0x2004) => 1, # e2 80 84 THREE-PER-EM SPACE +chr(0x2005) => 1, # e2 80 85 FOUR-PER-EM SPACE +chr(0x2006) => 1, # e2 80 86 SIX-PER-EM SPACE +chr(0x2007) => 1, # e2 80 87 FIGURE SPACE +chr(0x2008) => 1, # e2 80 87 PUNCTUATION SPACE +chr(0x2009) => 1, # e2 80 87 THIN SPACE +chr(0x200A) => 1, # e2 80 87 HAIR SPACE +chr(0x200B) => 1, # e2 80 87 ZERO WIDTH SPACE +chr(0x2024) => 1, # e2 80 a4 ONE DOT LEADER +chr(0x2025) => 1, # e2 80 a5 TWO DOT LEADER +chr(0x2026) => 1, # e2 80 a6 HORIZONTAL ELLIPSIS +chr(0x2027) => 1, # e2 80 a7 HYPHENATION POINT +chr(0x2028) => 1, # e2 80 a8 LINE SEPARATOR +chr(0x2029) => 1, # e2 80 a9 PARAGRAPH SEPARATOR +chr(0x2018) => 1, # e2 80 98 LEFT SINGLE QUOTATION MA +chr(0x2019) => 1, # e2 80 99 RIGHT SINGLE QUOTATION MARK +chr(0x201c) => 1, # e2 80 9c LEFT DOUBLE QUOTATION MARK +chr(0x201d) => 1, # e2 80 9d RIGHT DOUBLE QUOTATION MARK +chr(0x2010) => 1, # e2 80 90 HYPHEN +chr(0x2011) => 1, # e2 80 91 NON-BREAKING HYPHEN +chr(0x2012) => 1, # e2 80 92 FIGURE DASH +chr(0x2013) => 1, # e2 80 93 EN DASH +chr(0x2014) => 1, # e2 80 94 EM DASH +chr(0x2015) => 1, # e2 80 95 HORIZONTAL BAR +chr(0xFB00) => 1, #ef ac 80 LATIN SMALL LIGATURE FF +chr(0xFB01) => 1, #ef ac 81 LATIN SMALL LIGATURE FI +chr(0xFB02) => 1, #ef ac 82 LATIN SMALL LIGATURE FL +chr(0xFB03) => 1, #ef ac 83 LATIN SMALL LIGATURE FFI +chr(0xFB04) => 1, #ef ac 84 LATIN SMALL LIGATURE FFL +chr(0xFB05) => 1, #ef ac 85 LATIN SMALL LIGATURE LONG S T +chr(0xFB06) => 1, #ef ac 86 LATIN SMALL LIGATURE ST +chr(0xFFF9 ) => 1, #ef bf b9 INTERLINEAR ANNOTATION ANCHOR +chr(0xFFFA ) => 1, #ef bf ba INTERLINEAR ANNOTATION SEPARATOR +chr(0xFFFB ) => 1, #ef bf bb INTERLINEAR ANNOTATION TERMINATOR +chr(0xFFFC ) => 1, #ef bf bc OBJECT REPLACEMENT CHARACTER +chr(0xFFFD ) => 1 #ef bf bd REPLACEMENT CHARACTER }; + $EPrints::Index::FREETEXT_SEPERATOR_REGEXP = quotemeta(join "", keys %$EPrints::Index::FREETEXT_SEPERATOR_CHARS); $EPrints::Index::FREETEXT_SEPERATOR_REGEXP = qr/[$EPrints::Index::FREETEXT_SEPERATOR_REGEXP\x00-\x20]/; Best regards, Paolo > Hi Paolo, > > I took a quick peek at the sample that you were able to provide, and it looks like the character mapping is missing for the content text. If you export the PDF to text via Acrobat or equivalent, you can note via hex editor that the output text file has all characters mapped to ascii(0x2e), via a vanilla run of pdftotext (e.g. pdftotext test.pdf test.txt), the characters are mapped to ascii(0x20) and in unicode from pdftotext (as the command run by the indexer ~= pdftotext -enc UTF-8 -test.pdf test_utf.txt) you get the byte sequence "ef 80 bd" for each character. > > It may be possible to retroactively reconstitute the mapping information, but I'm not aware of a mechanism to do perform that operation. As well, it appears that this might have been done purposely when the PDF was generated - most tellingly, the licensing / attribution information at the conclusion of the file is mapped properly. > > p.s. thank you for the note / query on testing the indexed word lengths, it has notified us of a potential issue in our repository (and possibly others'?) whereby multiple words are being indexed in clusters because they are not tokenized on non-breaking space (' ') characters. > -- Ing. Paolo Tealdi Area IT - Politecnico Torino Telefono/Phone : +39-011-0906714 , FAX : +39-011-0906799 Indirizzo/Address : C.so Duca degli Abruzzi, 24 - 10129 Torino - ITALY Skype : tealdi.paolo Please consider your environmental responsibility before printing this e-mail *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech *** Archive: http://www.eprints.org/tech.php/ *** EPrints community wiki: http://wiki.eprints.org/
- References:
- [EP-tech] Garbage indexing some pdf
- From: Paolo Tealdi <paolo.tealdi@polito.it>
- [EP-tech] Re: Garbage indexing some pdf
- From: "Manojlovich, Slavko" <slavko@mun.ca>
- [EP-tech] Re: Garbage indexing some pdf
- From: Paolo Tealdi <paolo.tealdi@polito.it>
- [EP-tech] Re: Garbage indexing some pdf
- From: Paolo Tealdi <paolo.tealdi@polito.it>
- [EP-tech] Re: Garbage indexing some pdf
- From: <rchilliard@mun.ca>
- [EP-tech] Re: Garbage indexing some pdf
- From: Paolo Tealdi <paolo.tealdi@polito.it>
- [EP-tech] Garbage indexing some pdf
- Prev by Date: [EP-tech] simple search vs advanced search
- Next by Date: [EP-tech] Re: Eprints error Language code unknown
- Previous by thread: [EP-tech] Re: Garbage indexing some pdf
- Next by thread: [EP-tech] Search Index Troubles
- Index(es):