[EP-tech] A specific eprint doesn't get indexed ,

David R Newman drn at ecs.soton.ac.uk
Sat Mar 3 00:53:20 GMT 2018


Hi Avi,

I have noted this issue happening quite a lot as well.  I have tracked 
it down to an issue indexing PDF documents where the extracted word to 
be indexed contains non-ascii characters.  If the whole word is 
non-ascii characters, basically the empty string gets indexed, if there 
is more than one word that is all non-ascii characters, then it fails 
with the error you see below, as it cannot index the empty string twice 
for the same EPrint and field (i.e. documents).  This is because the 
eprint__rindex table has three fields that make up a primary key, field, 
word and eprintid. As the middle one is not set that is is why you see 
documents--91 rather than something like documents-word-91 in your error 
message.

As far as I can tell, this just effects this one badly encoded word from 
getting indexed rather than preventing all indexing for the whole 
EPrint.  I have tested this by writing a script to completely de-index 
an EPrint and then ran reindex,  I could see the records disappeared 
from the eprint__rindex table and then reappear again after the reindex.

I am going to see if I can get the encoding issue sorted out, as this is 
likely to be problematic for people who are indexing publications with 
non-Latin alphabets.  However, this is never straightforward, based on 
past experience.

Regards

David Newman

On 02/03/2018 10:53, Stenger, Avischai wrote:
>
> Hello 2 all,
>
> i have some eprints that do not get rindexed. If i execute, as an example:
>
> ~/bin/epadmin reindex REPO eprint 91
>
> i get The error:
>
> DBD::mysql::st execute failed: Duplicate entry 'documents--91' for key 
> 'PRIMARY' at /usr/share/eprints/bin/../perl_lib/EPrints/Database.pm 
> line 1287.
>
>
>
> i noticed that if i replace the PDF-Document in this eprint  i can 
> indexed it without any Error-message.
>
> if i check the PDF with some open-pdf-checker it says the PDF ist okay.
> (https://www.pdf-online.com/osa/validate.aspx)
>
>
> tnks and have a good weekend
>
>
> Avi
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
> *** Archive: http://www.eprints.org/tech.php/
> *** EPrints community wiki: http://wiki.eprints.org/
> *** EPrints developers Forum: http://forum.eprints.org/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20180303/cad9bd4c/attachment.html 


More information about the Eprints-tech mailing list