<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="Generator" content="Microsoft Word 14 (filtered medium)">
<style><!--
/* Font Definitions */
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
        {font-family:Tahoma;
        panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0cm;
        margin-bottom:.0001pt;
        font-size:12.0pt;
        font-family:"Times New Roman","serif";}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:purple;
        text-decoration:underline;}
p
        {mso-style-priority:99;
        margin:0cm;
        margin-bottom:.0001pt;
        font-size:12.0pt;
        font-family:"Times New Roman","serif";}
span.EmailStyle18
        {mso-style-type:personal-reply;
        font-family:"Arial","sans-serif";
        color:#1F497D;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-size:10.0pt;}
@page WordSection1
        {size:612.0pt 792.0pt;
        margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
        {page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body bgcolor="white" lang="SV" link="blue" vlink="purple">
<div class="WordSection1">
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#1F497D">Hi,<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#1F497D">Querying ”with the wrong keyboard” has always been an issue when non-english characters are involved.&nbsp; I agree that the modern way to simply drop
 accents and everything “non-ASCII-7” solves most problems as you otherwise needs to know how the letter sounds which is far from obvious.&nbsp;
<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#1F497D">Example, we translate the Swedish characters<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#1F497D">“å” and “ä” both to “a”<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#1F497D">“ö” to “o”<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#1F497D">For combined characters/litagues like the Dansih &nbsp;“æ” we substitute with both characters, in this case ”ae” as we feels this approach is most intuitive.<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#1F497D">We get a few false positives in the querying but this is very seldom an issue.&nbsp;
<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span lang="EN-US" style="font-size:11.0pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#1F497D">I have simply changed the Tokenizer.pm (and have a copy in our re-installation routines to replace this whenever we setup a new machine or what-not).&nbsp;
</span><span style="font-size:11.0pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#1F497D">Don’t know of any other way.<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#1F497D">/Christer<o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;font-family:&quot;Arial&quot;,&quot;sans-serif&quot;;color:#1F497D"><o:p>&nbsp;</o:p></span></p>
<div>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0cm 0cm 0cm">
<p class="MsoNormal"><b><span lang="EN-US" style="font-size:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;">From:</span></b><span lang="EN-US" style="font-size:10.0pt;font-family:&quot;Tahoma&quot;,&quot;sans-serif&quot;"> eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk]
<b>On Behalf Of </b>Liam Green-Hughes<br>
<b>Sent:</b> den 24 oktober 2017 15:28<br>
<b>To:</b> eprints-tech@ecs.soton.ac.uk<br>
<b>Subject:</b> [EP-tech] Problem with searching for names starting with Ö<o:p></o:p></span></p>
</div>
</div>
<p class="MsoNormal"><o:p>&nbsp;</o:p></p>
<div>
<p class="MsoNormal"><span style="font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:black">Hi everyone,<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:black"><o:p>&nbsp;</o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:black">We've run into an issue with searching for names containing certain characters and how they are handled by the Tokenizer.pm (<a href="https://github.com/eprints/eprints/blob/3.3/perl_lib/EPrints/Index/Tokenizer.pm">https://github.com/eprints/eprints/blob/3.3/perl_lib/EPrints/Index/Tokenizer.pm</a>)
 module. I notice in the FREETEXT_CHAR_MAPPING that characters are being substituted when indexing takes place or search terms are entered. Many of the substitutions make sense, but some others seem to be done on a phonetic basis? Strangely, this isn't an issue
 on the simple search form, but if a name is entered in the &quot;Creator&quot; field of the advanced search some strange things can happen.<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:black"><o:p>&nbsp;</o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:black">For example (btw names have been changed!) if an author exists on the system with the surname &quot;Öl&quot;, results will not be returned if I search by &quot;Ol&quot; but they will be if I enter
 &quot;Öl&quot; or, more suprisingly, &quot;Oel&quot; (thanks to the substitution made).&nbsp;<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:black"><o:p>&nbsp;</o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:black">I understand that in many languages letters such as these are considered to be entirely different characters, but when people search using an English language keyboard they tend
 to just drop the accents. This has led to a situation where results were not returned in an expected manner.&nbsp;<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:black"><o:p>&nbsp;</o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:black">Has anyone else encountered this problem? I can change the behaviour by changing the mappings in Tokenizer.pm but that means modifying core code. It also doesn't look to be easily
 overridable?<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:black"><o:p>&nbsp;</o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:black">Am very interested to hear any thoughts about how to approach this!<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:black"><o:p>&nbsp;</o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:black">Thanks<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:black">Liam​<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:black"><o:p>&nbsp;</o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:black">Library Systems Developer<o:p></o:p></span></p>
</div>
<div>
<p class="MsoNormal"><span style="font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:black">University of Kent<o:p></o:p></span></p>
</div>
<p><span style="font-family:&quot;Calibri&quot;,&quot;sans-serif&quot;;color:black"><o:p>&nbsp;</o:p></span></p>
</div>
</body>
</html>