EPrints Technical Mailing List Archive

Message: #00969


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

[EP-tech] Solution: Errors while indexing PDF/A files


Hi All,

   Just solved an issue which had been cropping up with our repository and thought I'd pass along the solution at which we've arrived. Our setup is Ubuntu 10.04 running eprints 3.3.7, though the issue will likely apply to most Linux based installs.

   When re-running indexing on our eprints via the console, we noted a large number of errors as the indexer progresses through documents, computing full text index info e.g.:

eprints@samplreposerver:~/bin$ ./epadmin reindex samplerepo eprint
You are about to reindex "eprint" in the samplerepo repository.
This can take some time.
Number of records in set: 141
Continue [y/n] ? es
Error: Illegal entry in bfchar block in ToUnicode CMap
Error: Illegal entry in bfchar block in ToUnicode CMap
Error: Bad annotation destination
...

We narrowed the issue down to the combination of the text extraction tool used for PDF files (pdftotext, a component of XPDF) and the particular formatting of the the large number of PDF/A formatted files in our repository. The root issue is that, at version 3.02 of xpdf, abbreviated character codes for Unicode characters in the <00xx> range are considered invalid within CMaps, despite being in agreement with the PDF/A format generally.

The solution as we've determined is to simply upgrade to the new version of xpdf (very recently released - 3.03, on 2012-08-15), which addresses the issue, permitting the characters in CMaps, and eliminating the (false) error messages. Unfortunately, xpdf 3.03 is not yet available via package manager for most Linux releases, so it must be installed from tarball (available at http://www.foolabs.com/xpdf/download.html). Hopefully this may prove some help to others -- though if you haven't been handling PDF/A files, you mightn't note the error at all.

Cheers,
Casey


Casey Hilliard
PC Consultant,
Health Sciences Library / QE2 Systems,
Memorial University
Phone: 709-777-2387 (HSL)
Phone: 709-864-6267 (QE2)

This electronic communication is governed by the terms and conditions at http://www.mun.ca/cc/policies/electronic_communications_disclaimer_2011.php

This electronic communication is governed by the terms and conditions at http://www.mun.ca/cc/policies/electronic_communications_disclaimer_2012.php