[EP-tech] Garbage indexing some pdf

Paolo Tealdi paolo.tealdi at polito.it
Thu Apr 26 11:39:05 BST 2012


Dear all,

we found  that some PDFs aren't parseable by pdftotext function, 
creating documents completely full of garbage.
It's not a permission problem (pdf blocked in copy&paste), more probably 
is an encoding problem : opening them with acrobat/xpfd, i can see 
character fonts with enconding type "embedded".
Did somebody  find this type of pdf files ? Anybody resolved ?

We can find them simply with this select :

select *,length(word) fron eprint__rindex where length(word) > 35

In attach a screenshot with some of the eprint__rindex records garbaged ...

With this select you'll find also pdfs using high unicode characters : a 
world will open you :-D

Best regards,
Paolo Tealdi

-- 
Ing. Paolo Tealdi         Area IT - Politecnico Torino
Telefono/Phone : +39-011-0906714 , FAX : +39-011-0906799
Indirizzo/Address : C.so Duca degli Abruzzi,  24 - 10129 Torino - ITALY
Skype : tealdi.paolo
Please consider your environmental responsibility before printing this e-mail

-------------- next part --------------
A non-text attachment was scrubbed...
Name: example.odt
Type: application/vnd.oasis.opendocument.text
Size: 343105 bytes
Desc: not available
Url : http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20120426/3aae31a5/attachment-0001.odt 


More information about the Eprints-tech mailing list