[EP-tech] Re: Search Index Troubles
Tim Brody
tdb2 at ecs.soton.ac.uk
Tue May 1 15:56:51 BST 2012
Hi,
Can you try this change:
http://trac.eprints.org/eprints/changeset/7669
(\w is any Unicode character/number)
/Tim.
On Mon, 2012-04-30 at 18:38 +0000, rchilliard at mun.ca wrote:
> Hi All,
>
>
>
> Over the last few days, we've been sorting out a few kinks with the
> with fulltext searching / index creation on our local EPrints
> repository and thought I'd pass along the notes in the hopes that it
> might help out others. The issues were noted upon performing the query
> noted by Paolo Tealdi a few days back seeking malformed content in the
> eprint index table:
>
>
>
> select *,length(word) from eprint__rindex where length(word) > 35
>
>
>
> In our local results we noted an number of 'word' values corresponding
> to eprints with pdf documents in which series of valid words were
> string together with assorted Unicode interspersed.
>
>
>
> The offending / troublesome Unicode values interspersed were inserted
> in the export from pdf to text, as called by eprints to generate the
> source fulltext to be indexed (called as '$(pdftotext) -enc UTF-8
> -layout $(SOURCE) $(TARGET)'). Owing to the '-layout' argument, many
> spaces, line endings and paragraph endings were converted to UTF-8
> formatting characters not handled by the default tokenizer (e.g. space
> to 'NON BREAKING SPACE' "chr(0x0a)", line ending to 'LINE SEPARATOR' -
> "\x{2028}" and paragraph ending to 'PARAGRAPH SEPARATOR' -
> "\x{2029}").
>
>
>
> These are easily identifiable for insertion into the list of
> delimiters, however, it seems that the list of delimiters
> ('FREETEXT_SEPERATOR_CHARS') is defined in both
> ~eprints/archives/{archiveid}/cfg/cfg.d/indexing.pl and
> ~eprints/perl_lib/EPrints/Index/Tokenizer.pm, only the latter of which
> appears to have any effect. (The former may be orphaned code specific
> to our repository)
>
>
>
> As may also be of note - in our case, resetting the indexed values
> seemed to require reloading the config (restarting apache and the
> indexer - to update Tokenizer.pm), as well as dropping the contents of
> the eprint__rindex table all before finally running epadmin
> erase_fulltext_index. To any who might be having their search
> misbehave, hopefully this may be of some help - any warnings,
> criticisms or comments welcome!
>
>
>
> NB: as our config could differ significantly from those out there, it
> might be best to test the above on a non-critical / test repository if
> it is of interest to you.
>
>
>
> Cheers,
>
> Casey
>
>
>
> Casey Hilliard
>
> PC Consultant,
>
> Health Sciences Library / QE2 Systems,
>
> Memorial University
>
> Phone: 709-777-2387 (HSL)
>
> Phone: 709-864-6267 (QE2)
>
>
>
> This communication is intended as a private communication for the sole
> use of the primary addressee. The information contained herein is
> private and confidential. If you are not the intended receipient, you
> are hereby notified that copying, forwarding or other dissemination or
> distribution of this communication by any means is prohibited. If you
> are not specifically authorized to receive this communication and you
> believe that you have received it in error, please notify the original
> sender immediately.
>
>
>
>
>
> This electronic communication is governed by the terms and conditions
> at
> http://www.mun.ca/cc/policies/electronic_communications_disclaimer_2012.php
> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
> *** Archive: http://www.eprints.org/tech.php/
> *** EPrints community wiki: http://wiki.eprints.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 490 bytes
Desc: This is a digitally signed message part
Url : http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20120501/e19f3448/attachment.bin
More information about the Eprints-tech
mailing list