[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[EP-tech] Elements-EPrints Odd Characters stopping upload



Hi James,

Ah, so it looks like the error message is wrong rather than necessarily 
the code.? I should probably fix that and change it to \\u{%04X}.? If 
you issue where the first fail_hi is called on the second in teh snippet 
of code you provided (i.e. which one is line 178).

Symplectic are responsible for the code in 
eprints3/symplectic/perl_lib/Symplectic/RepoProcess/MetadataManager.pm 
so I would not want to hack around with it.? This is why I think both us 
and they are keen for people to move to RT2 as having code that sits on 
top of EPrints maintained by a third-party is not ideal, as the change 
management process can be a nightmare.? You could suggest your idea to 
Symplectic but for such rare border cases (which can be resolved 
manually) and with RT2 available, I don't think they would be keen or 
making changes to the RT1 code unless it is a very small change and they 
can be confident it would not have any side effects.

Regards

David Newman

On 17/02/2021 12:18, James Kerwin wrote:
> *CAUTION:* This e-mail originated outside the University of Southampton.
> Hi?David,
>
> Thank you for your reply. Unfortunately I don't have access to the 
> Elements database(s) but I've explained this issue to our Elements 
> people and hopefully should get a response. Meanwhile, some time ago 
> Mr Salter gave me the means to extract the Elements xml and transform 
> it via the crosswalks outside of EPrints, so I may do that with the 
> different records and see what we get. Doing this has only just now 
> occurred to me now so I'll give it a go.
>
> On the subject of the character in question... The error code comes 
> from (I think!):
>
> eprints3/perl_lib/URI/Escape.pm
>
> Specifically here in the _fail_hi sub:
>
>             "sub uri_escape {
>             ? ? my($text, $patn) = @_;
>             ? ? return undef unless defined $text;
>             ? ? if (defined $patn){
>             ? ? ? ? unless (exists ?$subst{$patn}) {
>             ? ? ? ? ? ? # Because we can't compile the regex we fake
>             it with a cached sub
>             ? ? ? ? ? ? (my $tmp = $patn) =~ s,/,\\/,g;
>             ? ? ? ? ? ? eval "\$subst{\$patn} = sub {\$_[0] =~
>             s/([$tmp])/\$escapes{\$1} || _fail_hi(\$1)/ge; }";
>             ? ? ? ? ? ? Carp::croak("uri_escape: $@") if $@;
>             ? ? ? ? }
>             ? ? ? ? &{$subst{$patn}}($text);
>             ? ? } else {
>             ? ? ? ? $text =~ s/($Unsafe{RFC3986})/$escapes{$1} ||
>             _fail_hi($1)/ge;
>             ? ? }
>             ? ? $text;
>             }
>
>             sub _fail_hi {
>             ? ? my $chr = shift;
>             ? ? Carp::croak(sprintf "Can't escape \\x{%04X}, try
>             uri_escape_utf8() instead", ord($chr));"
>
> The FULL error log line says:
>
>             Can't escape \\x{2019}, try uri_escape_utf8() instead at
>             /opt/eprints3/perl_lib/URI/Escape.pm line
>             178.\n\tURI::Escape::_fail_hi('\xe2\x80\x99') called at
>             /opt/eprints3/perl_lib/URI/Escape.pm line
>             171\n\tURI::Escape::uri_escape('Published by the American
>             Physical Society under the terms of...') called at (eval
>             177) line
>             82\n\tEPrints::Config::uolrepo::__ANON__('dataset',
>             'EPrints::DataSet=HASH(0x7f21238f9358)', 'repository',
>             'Symplectic::Wrappers::EPrintsSession=HASH(0x7f2124610710)',
>             'dataobj',
>             'EPrints::DataObj::EPrint=HASH(0x7f21285879b0)',
>             'changed', 'HASH(0x7f212d684f18)') called at
>             /opt/eprints3/perl_lib/EPrints/DataSet.pm line
>             1517\n\tEPrints::DataSet::run_trigger('EPrints::DataSet=HASH(0x7f21238f9358)',
>             105, 'dataobj',
>             'EPrints::DataObj::EPrint=HASH(0x7f21285879b0)',
>             'changed', 'HASH(0x7f212d684f18)') called at
>             /opt/eprints3/perl_lib/EPrints/DataObj.pm line
>             669\n\tEPrints::DataObj::commit('EPrints::DataObj::EPrint=HASH(0x7f21285879b0)',
>             undef) called at
>             /opt/eprints3/perl_lib/EPrints/DataObj/EPrint.pm line
>             1011\n\tEPrints::DataObj::EPrint::commit('EPrints::DataObj::EPrint=HASH(0x7f21285879b0)')
>             called at
>             /opt/eprints3/perl_lib/Symplectic/RepoProcess/MetadataManager.pm
>             line
>             355\n\tSymplectic::RepoProcess::MetadataManager::add_preferred_bibliographic('Symplectic::RepoProcess::MetadataManager=HASH(0x7f2123858468)',
>             'eprint', 'EPrints::DataObj::EPrint=HASH(0x7f21285879b0)',
>             'raw_record',
>             'XML::LibXML::Document=SCALAR(0x7f212858bb60)', 'types',
>             'ARRAY(0x7f21254315a0)', 'limit_to',
>             'ARRAY(0x7f21215fceb8)', ...) called at
>             /opt/eprints3/perl_lib/Symplectic/RepoProcess/MetadataManager.pm
>             line
>             240\n\tSymplectic::RepoProcess::MetadataManager::add_bibliographic('Symplectic::RepoProcess::MetadataManager=HASH(0x7f2123858468)',
>             'eprint', 'EPrints::DataObj::EPrint=HASH(0x7f21285879b0)',
>             'publication',
>             'Symplectic::PubsModel::Publication=HASH(0x7f212d6b7fe8)')
>             called at
>             /opt/eprints3/perl_lib/Symplectic/RepoProcess/IngestWorkflow.pm
>             line
>             203\n\tSymplectic::RepoProcess::IngestWorkflow::update_metadata('Symplectic::RepoProcess::IngestWorkflow=HASH(0x7f212858f348)',
>             'eprint', 'EPrints::DataObj::EPrint=HASH(0x7f21285879b0)',
>             'publication',
>             'Symplectic::PubsModel::Publication=HASH(0x7f212d6b7fe8)',
>             'auth_details',
>             'Symplectic::PubsModel::AuthDetails=HASH(0x7f212d785c38)',
>             'record',
>             'Symplectic::RepoModel::PublicationsRecord=HASH(0x7f212c73f510)',
>             ...) called at
>             /opt/eprints3/perl_lib/Symplectic/RepoProcess/PublicationManager.pm
>             line
>             65\n\tSymplectic::RepoProcess::PublicationManager::get_deposit_representation('Symplectic::RepoProcess::PublicationManager=HASH(0x7f212d7ac290)',
>             'publication',
>             'Symplectic::PubsModel::Publication=HASH(0x7f212d6b7fe8)',
>             'auth_details',
>             'Symplectic::PubsModel::AuthDetails=HASH(0x7f212d785c38)')
>             called at
>             /opt/eprints3/perl_lib/Symplectic/Process/FileDepositProcessor.pm
>             line
>             148\n\tSymplectic::Process::FileDepositProcessor::handle('Symplectic::Process::FileDepositProcessor=HASH(0x7f212d6d73b0)',
>             'pid', 485375, 'auth_details',
>             'Symplectic::PubsModel::AuthDetails=HASH(0x7f212d785c38)',
>             'deposit_props',
>             'Symplectic::PubsModel::DepositProperties=HASH(0x7f212e8a0440)',
>             'atom', 'CGI::File::Temp=GLOB(0x7f212d7fae08)', ...)
>             called at
>             /opt/eprints3/perl_lib/Symplectic/Handlers/RepositoryHandler.pm
>             line
>             235\n\tSymplectic::Handlers::RepositoryHandler::post_handler('session',
>             'Symplectic::Wrappers::EPrintsSession=HASH(0x7f2124610710)',
>             'request', 'Apache2::RequestRec=SCALAR(0x7f212e8a77a8)',
>             'auth_details',
>             'Symplectic::PubsModel::AuthDetails=HASH(0x7f212d785c38)')
>             called at
>             /opt/eprints3/perl_lib/Symplectic/Handlers/RepositoryHandler.pm
>             line
>             109\n\tSymplectic::Handlers::RepositoryHandler::handler_multi('Apache2::RequestRec=SCALAR(0x7f212e8a77a8)',
>             undef) called at
>             /opt/eprints3/perl_lib/Symplectic/Apache/Rewrite.pm line
>             98\n\tSymplectic::Apache::Rewrite::__ANON__('Apache2::RequestRec=SCALAR(0x7f212e8a77a8)')
>             called at -e line 0\n\teval {...} called at -e line 0\n
>
>
> I'm making some big assumptions, but I THINK the "\\x{%04X}" is saying 
> "take 4 characters from the result of ord($chr) and put them here". 
> I'm possibly very wrong. I think any solution for this needs to belong 
> in the Symplectic code on the repo server. I don't fancy altering core 
> EPrints code for the sake of this. I'll be in a whole world of hell 
> before I know it. Yesterday when tracing this I ended up at:
>
> eprints3/symplectic/perl_lib/Symplectic/RepoProcess/MetadataManager.pm
>
> Reading through the code it appears to identify the preferred record 
> and start processing it. Perhaps this is a good opportunity to 
> intervene and either swap bad characters for good ones or 
> encode/decode "properly" (as if I know what I'm talking about). 
> Complicated slightly by not being able to thoroughly test it. I 
> suppose another option would be to see what XSLT etc. can do with 
> regard to this and so catch the problem within the crosswalks.
>
> If we verify the manual record in Elements it gets a higher precedence 
> than the Scopus record and so the problem disappears.
>
> Regarding the other problem with the file link I will need to 
> refamiliarise myself with it and I'll reply later. Plus this email is 
> already wordy enough as it is!
>
> Thanks,
> James
>
>
>
> On Wed, Feb 17, 2021 at 10:31 AM David R Newman <drn at ecs.soton.ac.uk 
> <mailto:drn at ecs.soton.ac.uk>> wrote:
>
>     Hi James,
>
>     I think you would need to look at this field in the Elements
>     record in its database to look how it is being stored differently
>     when there is an import compared to where there is manual entry.?
>     As you said I think the problem is in part that text box entries
>     get parsed and encoded before going into the database but imports
>     do not (or at very least the process between input and output to
>     the Elements database is different).? It would be useful to know
>     how they look different in the Elements database as they may
>     assist making EPrints more resilient to unexpected encodings in
>     future.
>
>     However "\\x{2019}" looks like an escaped version of something
>     that is not particularly valid.? If this was "\\u{2019}" this
>     would probably work better as \x I think can only be used to
>     represent a standard ASCII character that can be only two hex
>     digits like \x3a is a colon ":". \u is used for the extended
>     character set (i.e. UTF-16).? \u{2019} in UTF-8 would be \xE2\x80\x99.
>
>     It would be interesting to get a bit more information about your
>     other issue with regular quote marks and semi-colons that are part
>     of the standard ASCII set rather than an extended characters.?
>     These really should not be causing a problem.
>
>     Regards
>
>     David Newman
>
>     On 17/02/2021 09:44, James Kerwin via Eprints-tech wrote:
>>     *CAUTION:* This e-mail originated outside the University of
>>     Southampton.
>>     Hi All,
>>
>>     This is an Elements/EPrints question. Apologies that it isn't
>>     purely EPrints, but this is probably the best place to get an
>>     answer. I want to know if others experience this or it's some
>>     oddity to our setup.
>>
>>     We are using RT1 (for now) and EPrints 3.3.14 (also for now until
>>     upgrade). Occasionally we get an Elements record that is from
>>     Scopus, PubMed etc. that has an odd character in it that prevents
>>     upload. When I look in the Apache logs it tells me the problem.
>>     Yesterday's one was the presence of:
>>
>>     ?"Unicode Character ??? (U+2019)"
>>
>>     Which showed in the logs as:
>>
>>     "Can't escape \\x{2019}, try uri_escape_utf8() instead at
>>     /opt/eprints3/perl_lib/URI/Escape.pm"
>>
>>     Importantly if I copy the problem characters to the manual
>>     elements record it doesn't pose a problem. There appears some
>>     processing to properly encode characters entered via text box,
>>     but not characters dragged in from other sources into Elements.
>>
>>     I've also had the issue with the files containing "'" or" ";" etc
>>     not being accessible via Elements (a very similar, but different
>>     problem).
>>
>>     I found where I COULD fix the former issue, but it involves
>>     changing EPrints code when I SHOULD be altering the Symplectic
>>     connector code on the repo server.
>>
>>     Anyway, I'm not specifically looking for a solution, but has
>>     anybody else experienced anything similar? If so, does it stop
>>     with RT2? I hope to raise a ticket with Symplectic over this
>>     eventually.
>>
>>     Thanks,
>>     James
>>
>>
>>
>>     *** Options:http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech  <http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech>
>>     *** Archive:https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C2c361c99870d4fccdf5108d8d341cf9f%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637491626890720682%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=jdD8ujxiE9PEeshMvSqXoNPkE9pX9YBKnctFVHq2ya8%3D&amp;reserved=0  <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C2c361c99870d4fccdf5108d8d341cf9f%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637491626890720682%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=jdD8ujxiE9PEeshMvSqXoNPkE9pX9YBKnctFVHq2ya8%3D&amp;reserved=0>
>>     *** EPrints community wiki:https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C2c361c99870d4fccdf5108d8d341cf9f%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637491626890720682%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=AS%2BhE6DdoO4elG6UU8HnML6MdSPUefRckkxP0ivrt%2FE%3D&amp;reserved=0  <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C2c361c99870d4fccdf5108d8d341cf9f%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637491626890720682%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=AS%2BhE6DdoO4elG6UU8HnML6MdSPUefRckkxP0ivrt%2FE%3D&amp;reserved=0>
>
>     <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.avg.com%2Femail-signature%3Futm_medium%3Demail%26utm_source%3Dlink%26utm_campaign%3Dsig-email%26utm_content%3Demailclient&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C2c361c99870d4fccdf5108d8d341cf9f%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637491626890730638%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=xa56svKAT2k9zPJS6Oh5HiKFn2CxyzF0t%2BgfjJePorY%3D&amp;reserved=0>
>     	Virus-free. https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.avg.com%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C2c361c99870d4fccdf5108d8d341cf9f%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637491626890730638%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2FWftf8mk0lUrbwQEbX12KYKm5ChsGWY17lkNrV%2FRrvY%3D&amp;reserved=0
>     <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.avg.com%2Femail-signature%3Futm_medium%3Demail%26utm_source%3Dlink%26utm_campaign%3Dsig-email%26utm_content%3Demailclient&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C2c361c99870d4fccdf5108d8d341cf9f%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637491626890730638%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=xa56svKAT2k9zPJS6Oh5HiKFn2CxyzF0t%2BgfjJePorY%3D&amp;reserved=0>
>
>


-- 
This email has been checked for viruses by AVG.
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.avg.com%2F&amp;data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C2c361c99870d4fccdf5108d8d341cf9f%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637491626890730638%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=g7Te%2FsFl9FV7f0am2bzJzLd3DAkpbR73iZa6guhOIdE%3D&amp;reserved=0
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20210217/8aea178c/attachment-0001.html