[EP-tech] Re: Garbage indexing some pdf

Manojlovich, Slavko slavko at mun.ca
Thu Apr 26 13:02:11 BST 2012


Hi
Would you please provide an example of a PDF in your repository which demonstrates this problem?
Thanks
Slavko Manojlovich
Associate University Librarian (IT)
Memorial University of Newfoundland
St. John's, Newfoundland
Canada
email: slavko at mun..ca
 

________________________________

From: eprints-tech-bounces at ecs.soton.ac.uk on behalf of Paolo Tealdi
Sent: Thu 4/26/2012 8:09 AM
To: <eprints-tech at ecs.soton.ac.uk>
Subject: [EP-tech] Garbage indexing some pdf



Dear all,

we found  that some PDFs aren't parseable by pdftotext function,
creating documents completely full of garbage.
It's not a permission problem (pdf blocked in copy&paste), more probably
is an encoding problem : opening them with acrobat/xpfd, i can see
character fonts with enconding type "embedded".
Did somebody  find this type of pdf files ? Anybody resolved ?

We can find them simply with this select :

select *,length(word) fron eprint__rindex where length(word) > 35

In attach a screenshot with some of the eprint__rindex records garbaged ...

With this select you'll find also pdfs using high unicode characters : a
world will open you :-D

Best regards,
Paolo Tealdi

--
Ing. Paolo Tealdi         Area IT - Politecnico Torino
Telefono/Phone : +39-011-0906714 , FAX : +39-011-0906799
Indirizzo/Address : C.so Duca degli Abruzzi,  24 - 10129 Torino - ITALY
Skype : tealdi.paolo
Please consider your environmental responsibility before printing this e-mail




This electronic communication is governed by the terms and conditions at
http://www.mun.ca/cc/policies/electronic_communications_disclaimer_2012.php


More information about the Eprints-tech mailing list