<html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<p>Hi James,</p>
<p>Ah, so it looks like the error message is wrong rather than
necessarily the code. I should probably fix that and change it to
\\u{%04X}. If you issue where the first fail_hi is called on the
second in teh snippet of code you provided (i.e. which one is line
178).</p>
<p>Symplectic are responsible for the code in
eprints3/symplectic/perl_lib/Symplectic/RepoProcess/MetadataManager.pm
so I would not want to hack around with it. This is why I think
both us and they are keen for people to move to RT2 as having code
that sits on top of EPrints maintained by a third-party is not
ideal, as the change management process can be a nightmare. You
could suggest your idea to Symplectic but for such rare border
cases (which can be resolved manually) and with RT2 available, I
don't think they would be keen or making changes to the RT1 code
unless it is a very small change and they can be confident it
would not have any side effects.</p>
<p>Regards</p>
<p>David Newman<br>
</p>
<div class="moz-cite-prefix">On 17/02/2021 12:18, James Kerwin
wrote:<br>
</div>
<blockquote type="cite" cite="mid:CAKkNZ9DoSCBF7Uy9Ct2Rj-i+4BmW=D7epGD3K5tK2ViO+XfMLg@mail.gmail.com">
<div style="padding-bottom: 10px; padding-top: 5px;">
<div style="padding:12px; border:1px solid #8D3970;
background-color:#F7F9FA; color:#8D3970; font-size:14px;
line-height:22px; font-family: Calibri, Arial, Helvetica,
sans-serif;">
<strong>CAUTION:</strong> This e-mail originated outside the
University of Southampton.
</div>
</div>
<div>
<div dir="ltr">Hi David,
<div><br>
</div>
<div>Thank you for your reply. Unfortunately I don't have
access to the Elements database(s) but I've explained this
issue to our Elements people and hopefully should get a
response. Meanwhile, some time ago Mr Salter gave me the
means to extract the Elements xml and transform it via the
crosswalks outside of EPrints, so I may do that with the
different records and see what we get. Doing this has only
just now occurred to me now so I'll give it a go.</div>
<div><br>
</div>
<div>On the subject of the character in question... The error
code comes from (I think!):</div>
<div><br>
</div>
<div>eprints3/perl_lib/URI/Escape.pm</div>
<div><br>
</div>
<div>Specifically here in the _fail_hi sub:</div>
<div><br>
</div>
<blockquote style="margin:0 0 0 40px;border:none;padding:0px">
<blockquote style="margin:0 0 0
40px;border:none;padding:0px">
<div>
<blockquote style="margin:0 0 0
40px;border:none;padding:0px">
<div>"sub uri_escape {<br>
my($text, $patn) = @_;<br>
return undef unless defined $text;<br>
if (defined $patn){<br>
unless (exists $subst{$patn}) {<br>
# Because we can't compile the regex we
fake it with a cached sub<br>
(my $tmp = $patn) =~ s,/,\\/,g;<br>
eval "\$subst{\$patn} = sub {\$_[0] =~
s/([$tmp])/\$escapes{\$1} || _fail_hi(\$1)/ge; }";<br>
Carp::croak("uri_escape: $@") if $@;<br>
}<br>
&{$subst{$patn}}($text);<br>
} else {<br>
$text =~ s/($Unsafe{RFC3986})/$escapes{$1}
|| _fail_hi($1)/ge;<br>
}<br>
$text;<br>
}<br>
<br>
sub _fail_hi {<br>
my $chr = shift;<br>
Carp::croak(sprintf "Can't escape \\x{%04X}, try
uri_escape_utf8() instead", ord($chr));"<br>
</div>
<div><br>
</div>
</blockquote>
</div>
</blockquote>
</blockquote>
<div>The FULL error log line says:</div>
<div><br>
</div>
<blockquote style="margin:0 0 0 40px;border:none;padding:0px">
<blockquote style="margin:0 0 0
40px;border:none;padding:0px">
<blockquote style="margin:0 0 0
40px;border:none;padding:0px">
<div>Can't escape \\x{2019}, try uri_escape_utf8()
instead at /opt/eprints3/perl_lib/URI/Escape.pm line
178.\n\tURI::Escape::_fail_hi('\xe2\x80\x99') called
at /opt/eprints3/perl_lib/URI/Escape.pm line
171\n\tURI::Escape::uri_escape('Published by the
American Physical Society under the terms of...')
called at (eval 177) line
82\n\tEPrints::Config::uolrepo::__ANON__('dataset',
'EPrints::DataSet=HASH(0x7f21238f9358)', 'repository',
'Symplectic::Wrappers::EPrintsSession=HASH(0x7f2124610710)', 'dataobj',
'EPrints::DataObj::EPrint=HASH(0x7f21285879b0)',
'changed', 'HASH(0x7f212d684f18)') called at
/opt/eprints3/perl_lib/EPrints/DataSet.pm line
1517\n\tEPrints::DataSet::run_trigger('EPrints::DataSet=HASH(0x7f21238f9358)',
105, 'dataobj',
'EPrints::DataObj::EPrint=HASH(0x7f21285879b0)',
'changed', 'HASH(0x7f212d684f18)') called at
/opt/eprints3/perl_lib/EPrints/DataObj.pm line
669\n\tEPrints::DataObj::commit('EPrints::DataObj::EPrint=HASH(0x7f21285879b0)',
undef) called at
/opt/eprints3/perl_lib/EPrints/DataObj/EPrint.pm line
1011\n\tEPrints::DataObj::EPrint::commit('EPrints::DataObj::EPrint=HASH(0x7f21285879b0)')
called at
/opt/eprints3/perl_lib/Symplectic/RepoProcess/MetadataManager.pm
line
355\n\tSymplectic::RepoProcess::MetadataManager::add_preferred_bibliographic('Symplectic::RepoProcess::MetadataManager=HASH(0x7f2123858468)',
'eprint',
'EPrints::DataObj::EPrint=HASH(0x7f21285879b0)',
'raw_record',
'XML::LibXML::Document=SCALAR(0x7f212858bb60)',
'types', 'ARRAY(0x7f21254315a0)', 'limit_to',
'ARRAY(0x7f21215fceb8)', ...) called at
/opt/eprints3/perl_lib/Symplectic/RepoProcess/MetadataManager.pm
line
240\n\tSymplectic::RepoProcess::MetadataManager::add_bibliographic('Symplectic::RepoProcess::MetadataManager=HASH(0x7f2123858468)',
'eprint',
'EPrints::DataObj::EPrint=HASH(0x7f21285879b0)',
'publication',
'Symplectic::PubsModel::Publication=HASH(0x7f212d6b7fe8)')
called at
/opt/eprints3/perl_lib/Symplectic/RepoProcess/IngestWorkflow.pm
line
203\n\tSymplectic::RepoProcess::IngestWorkflow::update_metadata('Symplectic::RepoProcess::IngestWorkflow=HASH(0x7f212858f348)',
'eprint',
'EPrints::DataObj::EPrint=HASH(0x7f21285879b0)',
'publication',
'Symplectic::PubsModel::Publication=HASH(0x7f212d6b7fe8)',
'auth_details',
'Symplectic::PubsModel::AuthDetails=HASH(0x7f212d785c38)',
'record',
'Symplectic::RepoModel::PublicationsRecord=HASH(0x7f212c73f510)',
...) called at
/opt/eprints3/perl_lib/Symplectic/RepoProcess/PublicationManager.pm
line
65\n\tSymplectic::RepoProcess::PublicationManager::get_deposit_representation('Symplectic::RepoProcess::PublicationManager=HASH(0x7f212d7ac290)',
'publication',
'Symplectic::PubsModel::Publication=HASH(0x7f212d6b7fe8)',
'auth_details',
'Symplectic::PubsModel::AuthDetails=HASH(0x7f212d785c38)')
called at
/opt/eprints3/perl_lib/Symplectic/Process/FileDepositProcessor.pm
line
148\n\tSymplectic::Process::FileDepositProcessor::handle('Symplectic::Process::FileDepositProcessor=HASH(0x7f212d6d73b0)',
'pid', 485375, 'auth_details',
'Symplectic::PubsModel::AuthDetails=HASH(0x7f212d785c38)',
'deposit_props',
'Symplectic::PubsModel::DepositProperties=HASH(0x7f212e8a0440)',
'atom', 'CGI::File::Temp=GLOB(0x7f212d7fae08)', ...)
called at
/opt/eprints3/perl_lib/Symplectic/Handlers/RepositoryHandler.pm
line
235\n\tSymplectic::Handlers::RepositoryHandler::post_handler('session',
'Symplectic::Wrappers::EPrintsSession=HASH(0x7f2124610710)', 'request',
'Apache2::RequestRec=SCALAR(0x7f212e8a77a8)',
'auth_details',
'Symplectic::PubsModel::AuthDetails=HASH(0x7f212d785c38)')
called at
/opt/eprints3/perl_lib/Symplectic/Handlers/RepositoryHandler.pm
line
109\n\tSymplectic::Handlers::RepositoryHandler::handler_multi('Apache2::RequestRec=SCALAR(0x7f212e8a77a8)',
undef) called at
/opt/eprints3/perl_lib/Symplectic/Apache/Rewrite.pm
line
98\n\tSymplectic::Apache::Rewrite::__ANON__('Apache2::RequestRec=SCALAR(0x7f212e8a77a8)')
called at -e line 0\n\teval {...} called at -e line
0\n</div>
</blockquote>
</blockquote>
</blockquote>
<div><br>
</div>
<div>I'm making some big assumptions, but I THINK the
"\\x{%04X}" is saying "take 4 characters from the result of
ord($chr) and put them here". I'm possibly very wrong. I
think any solution for this needs to belong in the
Symplectic code on the repo server. I don't fancy altering
core EPrints code for the sake of this. I'll be in a whole
world of hell before I know it. Yesterday when tracing this
I ended up at:</div>
<div><br>
</div>
<div>eprints3/symplectic/perl_lib/Symplectic/RepoProcess/MetadataManager.pm</div>
<div><br>
</div>
<div>Reading through the code it appears to identify the
preferred record and start processing it. Perhaps this is a
good opportunity to intervene and either swap bad characters
for good ones or encode/decode "properly" (as if I know what
I'm talking about). Complicated slightly by not being able
to thoroughly test it. I suppose another option would be to
see what XSLT etc. can do with regard to this and so catch
the problem within the crosswalks.</div>
<div><br>
</div>
<div>If we verify the manual record in Elements it gets a
higher precedence than the Scopus record and so the problem
disappears.</div>
<div><br>
</div>
<div>Regarding the other problem with the file link I will
need to refamiliarise myself with it and I'll reply later.
Plus this email is already wordy enough as it is!</div>
<div><br>
</div>
<div>Thanks,</div>
<div>James</div>
<div><br>
</div>
<div><br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Wed, Feb 17, 2021 at
10:31 AM David R Newman <<a href="mailto:drn@ecs.soton.ac.uk" moz-do-not-send="true">drn@ecs.soton.ac.uk</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div>
<p>Hi James,</p>
<p>I think you would need to look at this field in the
Elements record in its database to look how it is being
stored differently when there is an import compared to
where there is manual entry. As you said I think the
problem is in part that text box entries get parsed and
encoded before going into the database but imports do
not (or at very least the process between input and
output to the Elements database is different). It would
be useful to know how they look different in the
Elements database as they may assist making EPrints more
resilient to unexpected encodings in future. <br>
</p>
<p>However "\\x{2019}" looks like an escaped version of
something that is not particularly valid. If this was
"\\u{2019}" this would probably work better as \x I
think can only be used to represent a standard ASCII
character that can be only two hex digits like \x3a is a
colon ":". \u is used for the extended character set
(i.e. UTF-16). \u{2019} in UTF-8 would be \xE2\x80\x99.<br>
</p>
<p>It would be interesting to get a bit more information
about your other issue with regular quote marks and
semi-colons that are part of the standard ASCII set
rather than an extended characters. These really should
not be causing a problem.</p>
<p>Regards</p>
<p>David Newman<br>
</p>
<div>On 17/02/2021 09:44, James Kerwin via Eprints-tech
wrote:<br>
</div>
<blockquote type="cite">
<div style="padding-bottom:10px;padding-top:5px">
<div style="padding:12px;border:1px solid
rgb(141,57,112);background-color:rgb(247,249,250);color:rgb(141,57,112);font-size:14px;line-height:22px;font-family:Calibri,Arial,Helvetica,sans-serif"><strong>CAUTION:</strong>
This e-mail originated outside the University of
Southampton.
</div>
</div>
<div>
<div dir="ltr">Hi All,<br>
<div><br>
</div>
<div>This is an Elements/EPrints question. Apologies
that it isn't purely EPrints, but this is probably
the best place to get an answer. I want to know if
others experience this or it's some oddity to our
setup.</div>
<div><br>
</div>
<div>We are using RT1 (for now) and EPrints 3.3.14
(also for now until upgrade). Occasionally we get
an Elements record that is from Scopus, PubMed
etc. that has an odd character in it that prevents
upload. When I look in the Apache logs it tells me
the problem. Yesterday's one was the presence of:<br>
<br>
"Unicode Character “’” (U+2019)" <br>
<br>
Which showed in the logs as:<br>
<br>
"Can't escape \\x{2019}, try uri_escape_utf8()
instead at /opt/eprints3/perl_lib/URI/Escape.pm"<br>
<br>
Importantly if I copy the problem characters to
the manual elements record it doesn't pose a
problem. There appears some processing to properly
encode characters entered via text box, but not
characters dragged in from other sources into
Elements.<br>
<br>
I've also had the issue with the files containing
"'" or" ";" etc not being accessible via Elements
(a very similar, but different problem).<br>
<br>
I found where I COULD fix the former issue, but it
involves changing EPrints code when I SHOULD be
altering the Symplectic connector code on the repo
server.<br>
<br>
Anyway, I'm not specifically looking for a
solution, but has anybody else experienced
anything similar? If so, does it stop with RT2? I
hope to raise a ticket with Symplectic over this
eventually.</div>
<div><br>
</div>
<div>Thanks,</div>
<div>James<br>
<br>
<br>
</div>
</div>
</div>
<br>
<fieldset></fieldset>
<pre>*** Options: <a href="http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech" target="_blank" moz-do-not-send="true">http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech</a>
*** Archive: <a href="https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C2c361c99870d4fccdf5108d8d341cf9f%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637491626890760510%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ZLct3e2zynhWsYl%2FhpiGLmWusRYTwh%2BAjuhe89jd6Xc%3D&reserved=0" originalSrc="http://www.eprints.org/tech.php/" shash="BaEm+VXnJs6Td4Xa3nDtX3SgNJINoTAtmnDABjI8Rf7Zm3jTEgDqJQeq7YaSCUHiuq8bHAnjZHD7aRAvflLqFvzmt3mFEhiB/MVwj5NZWiCZjYybi1Jm1Bq2+KAvYGTGsdyGpTgs+48Rvoiq/qFXsmR+cTxJLeUuAF3e2V9f6Wc=" originalsrc="http://www.eprints.org/tech.php/" shash="g2hRBZWw7zpfKDmVnu2avjZ2C/1m7d5qgqqzoFxFFXVXRMDfFNaEZGLXrMaNhLg03XjoOb2JBIos3ifHYnI+1ApQRHUdt3F1/NLPBkYz9BK++CPlHstZUlK2vvHOyl+XQQt16WZg6rggINdSdI75g1xfGNpIiCREDAu4aS2qw4s=" target="_blank" moz-do-not-send="true">http://www.eprints.org/tech.php/</a>
*** EPrints community wiki: <a href="https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C2c361c99870d4fccdf5108d8d341cf9f%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637491626890760510%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=YBiZT1ltibEeuf5wUCcP9gyl3Nyo%2BEsexcOWnc8pfIM%3D&reserved=0" originalSrc="http://wiki.eprints.org/" shash="ikY09A8ZqMVty2ZHVwQk/NBVeaA0B9H693aogOOqYJPmmRhcDkU0n+zu/dkgkeKCbrJocCMh0oBIeqE4TbwVYE7NLZ2ewSDTXRb3D4u/aye3X0569lgH0FWRUQ9MN+9xotYKo+VfSh6FteVzBg/+0oiAWvm2yNQPX+B2HgLbRQY=" originalsrc="http://wiki.eprints.org/" shash="Yh4gqnnMf1yNjQte4FxEaEjvh8W0Hm2DfSnX2e7DQh9QDkgVUwIucgOE42FQCslZRH16ZjyEWwTIHl3HupeFPCic58tO9s1CbSecVR5b45sBsfyg5GJSyYLt7Ug7AKsW5WXFKjtmuwd+vbAaWEqeUZgKLEdWcQe8/UOczrjbbjQ=" target="_blank" moz-do-not-send="true">http://wiki.eprints.org/</a></pre>
</blockquote>
<div id="gmail-m_8377127222243325055DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2"><br>
<table style="border-top:1px solid rgb(211,212,222)">
<tbody>
<tr>
<td style="width:55px;padding-top:13px"><a href="https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.avg.com%2Femail-signature%3Futm_medium%3Demail%26utm_source%3Dlink%26utm_campaign%3Dsig-email%26utm_content%3Demailclient&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C2c361c99870d4fccdf5108d8d341cf9f%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637491626890770467%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=p3AaT4BOPv3STNNry4EjwjxowFZYDeY4afUEdcuRpjg%3D&reserved=0" originalSrc="http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient" shash="s5kmJZXPa0l40OMXc7lX1p6dGqRne1GnNjromELfFzad5T6czwomSqyyuqxdzNZE/qNndhNy/yOXYwpnSIsOWlHHnVMBy+vs+cMkROVVuAbYX59Ixj5r2rCfstniBhZN8G/xn8YKn9W5jozdk2LqxpC1NHUzyamMPRQBIMXOoww=" originalsrc="http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient" shash="qB+Tha4VFtz798rh1YfXbTjNyJ8WNDbA4mw/G+uYTd/C25z6qFU4N9lORU7vNhbCJACrwH4mnFyGYfeVeGno332moe/ykxxamXltIjxgRCcxpOTuIjazSyBL/uWmHenpoge0oKRlXZUFFRhBgjWOG2PYFsDEnenhbWc//eIid/0=" target="_blank" moz-do-not-send="true"><img src="https://ipmcdn.avast.com/images/icons/icon-envelope-tick-green-avg-v1.png" alt="" style="width: 46px; height: 29px;" moz-do-not-send="true" width="46" height="29"></a></td>
<td style="width:470px;padding-top:12px;color:rgb(65,66,78);font-size:13px;font-family:Arial,Helvetica,sans-serif;line-height:18px">Virus-free.
<a href="https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.avg.com%2Femail-signature%3Futm_medium%3Demail%26utm_source%3Dlink%26utm_campaign%3Dsig-email%26utm_content%3Demailclient&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C2c361c99870d4fccdf5108d8d341cf9f%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637491626890770467%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=p3AaT4BOPv3STNNry4EjwjxowFZYDeY4afUEdcuRpjg%3D&reserved=0" originalSrc="http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient" shash="s5kmJZXPa0l40OMXc7lX1p6dGqRne1GnNjromELfFzad5T6czwomSqyyuqxdzNZE/qNndhNy/yOXYwpnSIsOWlHHnVMBy+vs+cMkROVVuAbYX59Ixj5r2rCfstniBhZN8G/xn8YKn9W5jozdk2LqxpC1NHUzyamMPRQBIMXOoww=" originalsrc="http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=emailclient" shash="qB+Tha4VFtz798rh1YfXbTjNyJ8WNDbA4mw/G+uYTd/C25z6qFU4N9lORU7vNhbCJACrwH4mnFyGYfeVeGno332moe/ykxxamXltIjxgRCcxpOTuIjazSyBL/uWmHenpoge0oKRlXZUFFRhBgjWOG2PYFsDEnenhbWc//eIid/0=" style="color:rgb(68,83,234)" target="_blank" moz-do-not-send="true">
www.avg.com</a> </td>
</tr>
</tbody>
</table>
</div>
</div>
</blockquote>
</div>
</div>
</blockquote>
</body>
</html>