[EP-tech] Re: Full text indexing document in Xapian search

Paolo Tealdi paolo.tealdi at polito.it
Fri Aug 31 11:10:48 BST 2012


On 08/30/2012 03:02 PM, Tim Brody wrote:
> On Thu, 2012-08-30 at 14:12 +0200, Paolo Tealdi wrote:
>> Dear all,
>>
>> i'm upgrading from 3.2.4 to 3.3.10 and evaluating the new features of
>> 3.3.10 version. I've installed Xapian search and i think that now simple
>> search is quicker than 3.2.4 one.
>> Nevertheless, i think that fulltext index is not present in Xapian
>> search. Am i right ?
>> How can i decide the fields list indexed in simple search (Xapian in my
>> case) ?
> Xapian should search all fields, including the documents, if EPrints can
> convert the document to plain text.
>
> The indexing code is in lib/cfg.d/search_xapian.pl.
>
> There isn't much help for you debugging what has gone wrong with
> indexing. Best I can suggest is adding this just above
> "replace_document_by_term":
>
> my $i = $doc->termlist_begin;
> print "$i, " while ++$i ne $doc->termlist_end;
> print "\n";
>
> Then:
>
> ./bin/epadmin reindex [archiveid] eprint [eprintid]
>
> For an eprint that isn't matching.
>
> Will show you exactly what's getting indexed for a given eprint.
Hi Tim,

thank you for your answer.
i debugged that file as you told me. As you told Xapian::Search indexes 
all fields including documents.
I noticed that Xapian search doesn't use the same separators as normal 
indexing program: this means that potentially  you can have many 
different words between the two indexing space  (probably this isn't a 
big problem for english language, but for instance for italian is) . Do 
you think that it could be possible avoid  this problem ? I searched for 
Xapian documentation and i didn't find anything on splitting words ...
I partially resolved  with this (brutal)  line :

$buffer =~ s/$EPrints::Index::FREETEXT_SEPERATOR_REGEXP/ /g;

put before "index_text" line in  lib/cfg.d/search_xapian.pl


Best regards,
Paolo Tealdi

-- 
Ing. Paolo Tealdi         Area IT - Politecnico Torino
Telefono/Phone : +39-011-0906714 , FAX : +39-011-0906799
Indirizzo/Address : C.so Duca degli Abruzzi,  24 - 10129 Torino - ITALY
Skype : tealdi.paolo
Please consider your environmental responsibility before printing this e-mail



More information about the Eprints-tech mailing list