<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<div style="padding-bottom: 10px; padding-top: 5px;">
<div style="padding:12px; border:1px solid #8D3970; background-color:#F7F9FA; color:#8D3970; font-size:14px; line-height:22px; font-family: Calibri, Arial, Helvetica, sans-serif;">
<strong>CAUTION:</strong> This e-mail originated outside the University of Southampton.
</div>
</div>
<div><font face="Default Sans Serif,Verdana,Arial,Helvetica,sans-serif" size="2">
<div>
<div>Hi Phil,</div>
</div>
<div><br>
</div>
<div>in the final end, reverse indexes of standard search engines are single term based. This is a basic principle.</div>
<div><br>
</div>
<div>Xapian is pretty basic in this matter - more advanced search engines such as ElasticSearch offer field types such as "keyword" that allow to store multi-term expression - in the end however, the Lucene backend also will store single terms in its reverse
indexes.</div>
<div><br>
</div>
<div>Still, there is the difficulty how to identify a multi-term expression within a bulk of text - this is usually the field of Natural Language Processing, and special tools and thesauri are needed.</div>
<div><br>
</div>
<div>Kind regards,</div>
<div><br>
</div>
<div>Martin</div>
<div><br>
</div>
<br>
<div><font color="#990099">-----<<a href="mailto:eprints-tech-bounces@ecs.soton.ac.uk" target="_blank" rel="noopener noreferrer">eprints-tech-bounces@ecs.soton.ac.uk</a>> schrieb: -----</font></div>
<div class="iNotesHistory" style="padding-left:5px;">
<div style="padding-right:0px;padding-left:5px;border-left:solid black 2px;">An: <<a href="mailto:eprints-tech@ecs.soton.ac.uk" target="_blank" rel="noopener noreferrer">eprints-tech@ecs.soton.ac.uk</a>>, "Phil Stacey" <<a href="mailto:phil@buildvoc.co.uk" target="_blank" rel="noopener noreferrer">phil@buildvoc.co.uk</a>><br>
Von: "David R Newman via Eprints-tech" <eprints-tech@ecs.soton.ac.uk><br>
Gesendet von: <eprints-tech-bounces@ecs.soton.ac.uk><br>
Datum: 25.01.2021 10:39<br>
Betreff: Re: [EP-tech] Help indexing phrases<br>
<br>
<!--Notes ACF
<meta http-equiv="Content-Type" content="text/html; charset=utf8">-->
<p>Hi Phil,</p>
<p>Unfortunately, I don't think this is possible. I think you would need to create a new field that is an id multiple field and use this. You could probably write a script to map from the uncontrolled keywords field into this new multiple id field. However,
even with this new field I am not sure how well Xapian would index these as individual multi-word terms. Advanced search for this field should work as you require. In 3.4.2 I introduced the Idci MetaField that is basically the same as the Id MetaField but
that matches case-insensitively, this is useful for mathcing things like email addresses and usernames, where case does not usually make a functional difference.</p>
<p>I have been thinking how best to implement a keywords fields that is more effective across simple, advanced and faceted search, particularly for multi-word terms. I have yet to conclude on a solution, as I need to better understand how Xapian indexing works
to see if it can be setup to allow EPrints to effectively index multiple-word terms.</p>
<p>Regards</p>
<p>David Newman<br>
</p>
<div class="moz-cite-prefix">On 25/01/2021 07:06, Phil Stacey via Eprints-tech wrote:<br>
</div>
<blockquote type="cite" cite="mid:EMEW3|91aac0adf16db2858d5c9f9d8f0d5ecex0O77j14eprints-tech-bounces|ecs.soton.ac.uk|310E323A-FF3A-4E9E-8F0C-B09433F71BAB@buildvoc.co.uk">
<div style="padding-bottom: 10px; padding-top: 5px;">
<div style="padding: 12px; border: 1px solid rgb(141, 57, 112); background-color: rgb(247, 249, 250); font-size: 14px; line-height: 22px; font-family: Calibri, Arial, Helvetica, sans-serif;">
<font color="#8d3970"><strong>CAUTION:</strong> This e-mail originated outside the University of Southampton.
</font></div>
</div>
<div><span>Using uncontrolled keywords field which has phrases separated by commas, like to index the </span><span>whole phrase.</span><br>
<span></span><br>
<span>For example :-</span><br>
<span>evacuation lift, part b - fire safety, b5 access and facilities for the fire</span><br>
<span>service, fire risk assessment, residual risk, building safety, b4 external</span><br>
<span>fire spread, means of escape, principal works, health & safety strategy</span><br>
<span><a href="https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Feprints.buildvoc.co.uk%2Fid%2Feprint%2F865%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cd650fd3e9fd54110c0b408d8c127381e%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637471721476230192%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=HxGHSYJvXoiSJDFOTFNeOr4%2FPHRj4o36twWWaqdpcYM%3D&reserved=0" originalSrc="https://eprints.buildvoc.co.uk/id/eprint/865/" shash="TfFP+ehuvFU5ZYxQtWgPbz1pHSI96e3DhnN3Xt3CG9GXGOxBLOirRYBiPtU1XSw2eWu8hNOt976Zb3fzbO8AhT86B8mLYPw4vv1629SO/mXA3mQB/uRwe3h+3C7Ny2falb9r8UvRExWsp5OitWT28tPtA44QlYP+a3teqhTgntY=" originalsrc="https://eprints.buildvoc.co.uk/id/eprint/865/" shash="F64kHU69R5fDpwDpGWctuRLzPG/t63dI06NJOg//Uz50J6SXghNAi3bWLStJAEsrYwZ9rnapN/oFrhvD74LOTvoOGjogM8KDsSWK4BnQovx4OHFcf7ya/jiFrHoMFl4vS9L2YVny+V3mIGTd2McCLJXLHymALOgU1wJfcmLlp2I=" dir="ltr" x-apple-data-detectors="true" x-apple-data-detectors-type="link" x-apple-data-detectors-result="10" moz-do-not-send="true">https://eprints.buildvoc.co.uk/id/eprint/865/</a></span><br>
<span></span><br>
<span>Question how do I configure xapian or indexing.pl to index the whole phrase instead of the </span><span>individual terms for example fire, safety, or building</span><br>
<br>
<div dir="ltr">
<p class="MsoNormal" style="margin: 0cm 0cm 0.0001pt;">Best Regards,</p>
<p class="MsoNormal" style="margin: 0cm 0cm 0.0001pt;"><span>Phil Stacey</p>
<p></p>
</span>
<p></p>
<p class="MsoNormal" style="margin: 0cm 0cm 0.0001pt;"><a href="https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Feprints.buildvoc.co.uk%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cd650fd3e9fd54110c0b408d8c127381e%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637471721476240148%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=i6OK1nFpzQxIcLAv20ZLeLpezp6JVuo%2Bkv0UnGq8ecA%3D&reserved=0" originalSrc="https://eprints.buildvoc.co.uk/" shash="SDFJ4Gs7nka+p8U8fo6VySfiOgBTsp7kPiZrA2Ri5m17tgyNNg4Lzh7ERuQ8dhKiXAZ3rONbtCK8x4tcBNNeWKVRYMSwjfljtnLq5gR4Ifk46VRfnfGmBdCytE9VBB4ss6E4pckaO9K587wyjBP/4IGeUFTg3YljyYaD0kl/Wds=" originalsrc="https://eprints.buildvoc.co.uk/" shash="XboRaahhUBN2IEyiq2ETlwl5Af2vYxtdqgz9MiyZmouFcv1NvnJ+7I+nYmvJ4JGpzZ+KuMnrH2rnFsTOBp9tEyYKXFvcIK3+1feQGMABqxO7he1jX+FussOOzs9OLM6CTdfIow1iS5FnfUWub6dHYH9Vc4q9IIcUQ3DTRxW57qg=" moz-do-not-send="true"><font color="#000000">building
regulations guidance for fire safety</font></a></p>
</div>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<div><font face="Courier New,Courier,monospace" size="2">*** Options: <a class="moz-txt-link-freetext" href="http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech">
http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech</a><br>
*** Archive: <a class="moz-txt-link-freetext" href="https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cd650fd3e9fd54110c0b408d8c127381e%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637471721476240148%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=bEIzjHKG2Vxgpr6OC%2FZOCna3%2BY1%2FhDUx%2FiET7hxkMdo%3D&reserved=0" originalSrc="http://www.eprints.org/tech.php/" shash="qdTBTugZ/V1fpMJcgv75Oj4eqFrb7bl8lJOJdLrCP2NhR59KZuAtoj8Ab9gLmxCSbX9QFAB3pN2MX/BqbJvWLJQ4sIkFBLQLeO9F0qPL+BUP8GlJ6fnudFOEglGbUX2XQJo9Gj/yHwLWw20KwH0tJ3bsP5B9pNTAYyoaWRZ050k=" originalsrc="http://www.eprints.org/tech.php/" shash="AAD4h5GVgu8P7BFQojszZKyXqnkhG/YjBJPHZl8QRgfyc10oilxzLnRxBvNGJis8mhK2MP2XtS4r+v6NRyD9S0HqxMjIBfS5hf2C2oKGb6ebW9nKX1m8WEEaqU9tLzI9AcAGzdS5+1JqkXLCOdoHp2m3klnfxYsw2W6qldjVXUQ=">
http://www.eprints.org/tech.php/</a><br>
*** EPrints community wiki: <a class="moz-txt-link-freetext" href="https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cd650fd3e9fd54110c0b408d8c127381e%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637471721476250107%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=dJRW5zoxO%2BuJCBAY%2FpRJunauEhVk1Kv9qIKK%2Fy7KU5Y%3D&reserved=0" originalSrc="http://wiki.eprints.org/" shash="gVKKu9ILWkc4KvNgTXDfdoBQTJUGVgeVCBqVqIf16XDLVvIDixb9CaN9ztyX8I8H9sffyCdIpks3jz0i4d4JQs8fgbH3/Jv9OlmLtAzmGMd6SCZRZ1ippsOP3uT9ZfLwACIFyOcX8GkdBREKQDfqthBqUpUuqnSl7IIZ5TMRoSI=" originalsrc="http://wiki.eprints.org/" shash="WPSXgNnnVZ5sh9Dchydwny37k8BzklNAN7SHxWMwGnESXZf9e+nqP6N508e5plhgTkb691lUjYP4al2I1zT/cO7I/JXG//OmWrr8jvHl062vNmfuSMw8EG8TvMivIDLa39n3MjXlS0tfsztVkm0SlVuAWntr9iHDwxyzbk0VZjE=">
http://wiki.eprints.org/</a></font></div>
</blockquote>
<div id="DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2"><br>
<table style="border-top: 1px solid #D3D4DE;">
<tbody>
<tr>
<td style="width: 55px; padding-top: 13px;"><a href="https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.avg.com%2Femail-signature%3Futm_medium%3Demail%26utm_source%3Dlink%26utm_campaign%3Dsig-email%26utm_content%3Demailclient&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cd650fd3e9fd54110c0b408d8c127381e%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637471721476250107%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=YDRvq6lp0%2FsNoztUpAivGKzJXFa4AgZOWt72uO4c%2BZ4%3D&reserved=0" originalSrc="http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient" shash="bv6JeS2B5t/0IcBTTrMLo7mz0SIWKS/VmIZbwhyFf8Eoo7BWaEb7kgJzl3gnnb4A4tgqtUXUMBoAR9mIk5/+yOcg5UPgHADaA4ypW7KnCcZgcrPCMtipcxijOX8g0H0+pPzQHPtfHJw+0sny+CFDYboElEXAlPCUXf9tdL5xaug=" originalsrc="http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient" shash="vFENfiMUMgft7z+DQLqNgDrMDTlgVI8itVMcDfA+nhh/F6ttQfMM3tQCoMYMJvdXo7hQazVIEhQWeMf2rBwQOoJi/1+/Dkhdo8N0+C8CMv7ydwyWsCIdXq87OuWQBIWnZVIDuVA6L0YPLZbSrB4f8b1lxJd1DPLk+xkE7KbS/4s=" target="_blank"><img src="https://ipmcdn.avast.com/images/icons/icon-envelope-tick-green-avg-v1.png" alt="" width="46" height="29" style="width: 46px; height: 29px;"></a></td>
<td style="width: 470px; padding-top: 12px; color: #41424e; font-size: 13px; font-family: Arial, Helvetica, sans-serif; line-height: 18px;">
Virus-free. <a href="https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.avg.com%2Femail-signature%3Futm_medium%3Demail%26utm_source%3Dlink%26utm_campaign%3Dsig-email%26utm_content%3Demailclient&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cd650fd3e9fd54110c0b408d8c127381e%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637471721476260061%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=nFIEHiZX%2FCy0bn3T2gYaZfhUFjpq2aNf95Ix8xX9S1A%3D&reserved=0" originalSrc="http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient" shash="ZJGXO9Uo2v9z6WWHYqW+82jA1RsSjzKfdyeye0Q5eprNtmU/HoBp+wbsrWnUUCR2nD8FDBJ1vPCCNLKZ4w355Ux1TArhsTE1lcdCmH47P1m8C2kNb+6p4kGZcW3Y4IYqY/t3OJRrQ9Lju5TNRbphkOkN3ThQpWvUi4awEi6e0DQ=" originalsrc="http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient" shash="RARR13F0zwVPUoGaskhYcCljGqPHEvjUqDpkgYo4tts1DwVE+Y8nAtbQdHZnv2YWH37kvkLZjdsrpMqm2Gl3Qzv0vmbi1/LU8Id8te6/ZbkCLgIKyOmwHuTRy7i3fpkmukH36oeVLNrx6iYOYp0XjY6rfCdX0AuLAje4hs8o+O8=" target="_blank" style="">
<font color="#4453ea">www.avg.com</font></a> </td>
</tr>
</tbody>
</table>
<a href="#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2" width="1" height="1"></a></div>
<div><font face="Courier New,Courier,monospace" size="2">*** Options: <a href="http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech">
http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech</a><br>
*** Archive: <a href="https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cd650fd3e9fd54110c0b408d8c127381e%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637471721476260061%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=R392cZkylivTWQGzrPEgXi4nrCGRImQo0JRmzAo3dbg%3D&reserved=0" originalSrc="http://www.eprints.org/tech.php/" shash="H3hGqRT11Us9q7X/SJfWAzuLPXDo4i6UeOVDfcyyKWzimPCqlftQeS6xPTtVlQ93ZDI7aWRsQuq11OasUE1nL+mHt2ncyQ0XOpOeG/HfhhopDGy1VKRh4S3+IZPufG+pdgUudMaPLKMHhMPNiesy48aLdyurX/bkHay2EqYVS6s=">http://www.eprints.org/tech.php/</a><br>
*** EPrints community wiki: <a href="https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7Cd650fd3e9fd54110c0b408d8c127381e%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637471721476270018%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=GpMrKpgKT374oEmaq993ZDlSiVv40laaKeHlTZjWSKs%3D&reserved=0" originalSrc="http://wiki.eprints.org/" shash="td/sbw9t88ILZ/yBAhlguPKiRj0v82h8VJFfHF8u1GRI4OzgAKm/yua2U3PVDkHyRJXpi25WlRSZ3JlijTPtuDnpWedH9BqO51ZrMTuY+VWYfiLUzQ3nMoTDTjnHOqxrIWiTeIYspZeFX7363OZZ4lkH6YFzrqOL0o3Hz+gS90g=">http://wiki.eprints.org/</a></font></div>
</eprints-tech-bounces@ecs.soton.ac.uk></eprints-tech@ecs.soton.ac.uk></div>
</div>
<div></div>
</font></div>
</body>
</html>