[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[EP-tech] Elements-EPrints Odd Characters stopping upload
- Subject: [EP-tech] Elements-EPrints Odd Characters stopping upload
- From: drn at ecs.soton.ac.uk (David R Newman)
- Date: Wed, 17 Feb 2021 12:44:47 +0000
- In-reply-to: <CAKkNZ9DoSCBF7Uy9Ct2Rj-i+4BmW=D7epGD3K5tK2ViO+XfMLg@mail.gmail.com>
- References: <CAKkNZ9CgeZoYtTU5D0yVPy1VsYxK1gNmKPn9c9=8gBRRbc1Msw@mail.gmail.com> <EMEW3|cd4942bd9091e1691c4e24848fa90e3bx1G9ju14eprints-tech-bounces|ecs.soton.ac.uk|CAKkNZ9CgeZoYtTU5D0yVPy1VsYxK1gNmKPn9c9=8gBRRbc1Msw@mail.gmail.com> <39b12b0f-cb82-10e7-40df-4289fe857ca2@ecs.soton.ac.uk> <CAKkNZ9DoSCBF7Uy9Ct2Rj-i+4BmW=D7epGD3K5tK2ViO+XfMLg@mail.gmail.com> <234f9e9c-b6f0-7925-fb36-305fcb547562@ecs.soton.ac.uk>
Hi James,
Ah, so it looks like the error message is wrong rather than necessarily
the code.? I should probably fix that and change it to \\u{%04X}.? If
you issue where the first fail_hi is called on the second in teh snippet
of code you provided (i.e. which one is line 178).
Symplectic are responsible for the code in
eprints3/symplectic/perl_lib/Symplectic/RepoProcess/MetadataManager.pm
so I would not want to hack around with it.? This is why I think both us
and they are keen for people to move to RT2 as having code that sits on
top of EPrints maintained by a third-party is not ideal, as the change
management process can be a nightmare.? You could suggest your idea to
Symplectic but for such rare border cases (which can be resolved
manually) and with RT2 available, I don't think they would be keen or
making changes to the RT1 code unless it is a very small change and they
can be confident it would not have any side effects.
Regards
David Newman
On 17/02/2021 12:18, James Kerwin wrote:
> *CAUTION:* This e-mail originated outside the University of Southampton.
> Hi?David,
>
> Thank you for your reply. Unfortunately I don't have access to the
> Elements database(s) but I've explained this issue to our Elements
> people and hopefully should get a response. Meanwhile, some time ago
> Mr Salter gave me the means to extract the Elements xml and transform
> it via the crosswalks outside of EPrints, so I may do that with the
> different records and see what we get. Doing this has only just now
> occurred to me now so I'll give it a go.
>
> On the subject of the character in question... The error code comes
> from (I think!):
>
> eprints3/perl_lib/URI/Escape.pm
>
> Specifically here in the _fail_hi sub:
>
> "sub uri_escape {
> ? ? my($text, $patn) = @_;
> ? ? return undef unless defined $text;
> ? ? if (defined $patn){
> ? ? ? ? unless (exists ?$subst{$patn}) {
> ? ? ? ? ? ? # Because we can't compile the regex we fake
> it with a cached sub
> ? ? ? ? ? ? (my $tmp = $patn) =~ s,/,\\/,g;
> ? ? ? ? ? ? eval "\$subst{\$patn} = sub {\$_[0] =~
> s/([$tmp])/\$escapes{\$1} || _fail_hi(\$1)/ge; }";
> ? ? ? ? ? ? Carp::croak("uri_escape: $@") if $@;
> ? ? ? ? }
> ? ? ? ? &{$subst{$patn}}($text);
> ? ? } else {
> ? ? ? ? $text =~ s/($Unsafe{RFC3986})/$escapes{$1} ||
> _fail_hi($1)/ge;
> ? ? }
> ? ? $text;
> }
>
> sub _fail_hi {
> ? ? my $chr = shift;
> ? ? Carp::croak(sprintf "Can't escape \\x{%04X}, try
> uri_escape_utf8() instead", ord($chr));"
>
> The FULL error log line says:
>
> Can't escape \\x{2019}, try uri_escape_utf8() instead at
> /opt/eprints3/perl_lib/URI/Escape.pm line
> 178.\n\tURI::Escape::_fail_hi('\xe2\x80\x99') called at
> /opt/eprints3/perl_lib/URI/Escape.pm line
> 171\n\tURI::Escape::uri_escape('Published by the American
> Physical Society under the terms of...') called at (eval
> 177) line
> 82\n\tEPrints::Config::uolrepo::__ANON__('dataset',
> 'EPrints::DataSet=HASH(0x7f21238f9358)', 'repository',
> 'Symplectic::Wrappers::EPrintsSession=HASH(0x7f2124610710)',
> 'dataobj',
> 'EPrints::DataObj::EPrint=HASH(0x7f21285879b0)',
> 'changed', 'HASH(0x7f212d684f18)') called at
> /opt/eprints3/perl_lib/EPrints/DataSet.pm line
> 1517\n\tEPrints::DataSet::run_trigger('EPrints::DataSet=HASH(0x7f21238f9358)',
> 105, 'dataobj',
> 'EPrints::DataObj::EPrint=HASH(0x7f21285879b0)',
> 'changed', 'HASH(0x7f212d684f18)') called at
> /opt/eprints3/perl_lib/EPrints/DataObj.pm line
> 669\n\tEPrints::DataObj::commit('EPrints::DataObj::EPrint=HASH(0x7f21285879b0)',
> undef) called at
> /opt/eprints3/perl_lib/EPrints/DataObj/EPrint.pm line
> 1011\n\tEPrints::DataObj::EPrint::commit('EPrints::DataObj::EPrint=HASH(0x7f21285879b0)')
> called at
> /opt/eprints3/perl_lib/Symplectic/RepoProcess/MetadataManager.pm
> line
> 355\n\tSymplectic::RepoProcess::MetadataManager::add_preferred_bibliographic('Symplectic::RepoProcess::MetadataManager=HASH(0x7f2123858468)',
> 'eprint', 'EPrints::DataObj::EPrint=HASH(0x7f21285879b0)',
> 'raw_record',
> 'XML::LibXML::Document=SCALAR(0x7f212858bb60)', 'types',
> 'ARRAY(0x7f21254315a0)', 'limit_to',
> 'ARRAY(0x7f21215fceb8)', ...) called at
> /opt/eprints3/perl_lib/Symplectic/RepoProcess/MetadataManager.pm
> line
> 240\n\tSymplectic::RepoProcess::MetadataManager::add_bibliographic('Symplectic::RepoProcess::MetadataManager=HASH(0x7f2123858468)',
> 'eprint', 'EPrints::DataObj::EPrint=HASH(0x7f21285879b0)',
> 'publication',
> 'Symplectic::PubsModel::Publication=HASH(0x7f212d6b7fe8)')
> called at
> /opt/eprints3/perl_lib/Symplectic/RepoProcess/IngestWorkflow.pm
> line
> 203\n\tSymplectic::RepoProcess::IngestWorkflow::update_metadata('Symplectic::RepoProcess::IngestWorkflow=HASH(0x7f212858f348)',
> 'eprint', 'EPrints::DataObj::EPrint=HASH(0x7f21285879b0)',
> 'publication',
> 'Symplectic::PubsModel::Publication=HASH(0x7f212d6b7fe8)',
> 'auth_details',
> 'Symplectic::PubsModel::AuthDetails=HASH(0x7f212d785c38)',
> 'record',
> 'Symplectic::RepoModel::PublicationsRecord=HASH(0x7f212c73f510)',
> ...) called at
> /opt/eprints3/perl_lib/Symplectic/RepoProcess/PublicationManager.pm
> line
> 65\n\tSymplectic::RepoProcess::PublicationManager::get_deposit_representation('Symplectic::RepoProcess::PublicationManager=HASH(0x7f212d7ac290)',
> 'publication',
> 'Symplectic::PubsModel::Publication=HASH(0x7f212d6b7fe8)',
> 'auth_details',
> 'Symplectic::PubsModel::AuthDetails=HASH(0x7f212d785c38)')
> called at
> /opt/eprints3/perl_lib/Symplectic/Process/FileDepositProcessor.pm
> line
> 148\n\tSymplectic::Process::FileDepositProcessor::handle('Symplectic::Process::FileDepositProcessor=HASH(0x7f212d6d73b0)',
> 'pid', 485375, 'auth_details',
> 'Symplectic::PubsModel::AuthDetails=HASH(0x7f212d785c38)',
> 'deposit_props',
> 'Symplectic::PubsModel::DepositProperties=HASH(0x7f212e8a0440)',
> 'atom', 'CGI::File::Temp=GLOB(0x7f212d7fae08)', ...)
> called at
> /opt/eprints3/perl_lib/Symplectic/Handlers/RepositoryHandler.pm
> line
> 235\n\tSymplectic::Handlers::RepositoryHandler::post_handler('session',
> 'Symplectic::Wrappers::EPrintsSession=HASH(0x7f2124610710)',
> 'request', 'Apache2::RequestRec=SCALAR(0x7f212e8a77a8)',
> 'auth_details',
> 'Symplectic::PubsModel::AuthDetails=HASH(0x7f212d785c38)')
> called at
> /opt/eprints3/perl_lib/Symplectic/Handlers/RepositoryHandler.pm
> line
> 109\n\tSymplectic::Handlers::RepositoryHandler::handler_multi('Apache2::RequestRec=SCALAR(0x7f212e8a77a8)',
> undef) called at
> /opt/eprints3/perl_lib/Symplectic/Apache/Rewrite.pm line
> 98\n\tSymplectic::Apache::Rewrite::__ANON__('Apache2::RequestRec=SCALAR(0x7f212e8a77a8)')
> called at -e line 0\n\teval {...} called at -e line 0\n
>
>
> I'm making some big assumptions, but I THINK the "\\x{%04X}" is saying
> "take 4 characters from the result of ord($chr) and put them here".
> I'm possibly very wrong. I think any solution for this needs to belong
> in the Symplectic code on the repo server. I don't fancy altering core
> EPrints code for the sake of this. I'll be in a whole world of hell
> before I know it. Yesterday when tracing this I ended up at:
>
> eprints3/symplectic/perl_lib/Symplectic/RepoProcess/MetadataManager.pm
>
> Reading through the code it appears to identify the preferred record
> and start processing it. Perhaps this is a good opportunity to
> intervene and either swap bad characters for good ones or
> encode/decode "properly" (as if I know what I'm talking about).
> Complicated slightly by not being able to thoroughly test it. I
> suppose another option would be to see what XSLT etc. can do with
> regard to this and so catch the problem within the crosswalks.
>
> If we verify the manual record in Elements it gets a higher precedence
> than the Scopus record and so the problem disappears.
>
> Regarding the other problem with the file link I will need to
> refamiliarise myself with it and I'll reply later. Plus this email is
> already wordy enough as it is!
>
> Thanks,
> James
>
>
>
> On Wed, Feb 17, 2021 at 10:31 AM David R Newman <drn at ecs.soton.ac.uk
> <mailto:drn at ecs.soton.ac.uk>> wrote:
>
> Hi James,
>
> I think you would need to look at this field in the Elements
> record in its database to look how it is being stored differently
> when there is an import compared to where there is manual entry.?
> As you said I think the problem is in part that text box entries
> get parsed and encoded before going into the database but imports
> do not (or at very least the process between input and output to
> the Elements database is different).? It would be useful to know
> how they look different in the Elements database as they may
> assist making EPrints more resilient to unexpected encodings in
> future.
>
> However "\\x{2019}" looks like an escaped version of something
> that is not particularly valid.? If this was "\\u{2019}" this
> would probably work better as \x I think can only be used to
> represent a standard ASCII character that can be only two hex
> digits like \x3a is a colon ":". \u is used for the extended
> character set (i.e. UTF-16).? \u{2019} in UTF-8 would be \xE2\x80\x99.
>
> It would be interesting to get a bit more information about your
> other issue with regular quote marks and semi-colons that are part
> of the standard ASCII set rather than an extended characters.?
> These really should not be causing a problem.
>
> Regards
>
> David Newman
>
> On 17/02/2021 09:44, James Kerwin via Eprints-tech wrote:
>> *CAUTION:* This e-mail originated outside the University of
>> Southampton.
>> Hi All,
>>
>> This is an Elements/EPrints question. Apologies that it isn't
>> purely EPrints, but this is probably the best place to get an
>> answer. I want to know if others experience this or it's some
>> oddity to our setup.
>>
>> We are using RT1 (for now) and EPrints 3.3.14 (also for now until
>> upgrade). Occasionally we get an Elements record that is from
>> Scopus, PubMed etc. that has an odd character in it that prevents
>> upload. When I look in the Apache logs it tells me the problem.
>> Yesterday's one was the presence of:
>>
>> ?"Unicode Character ??? (U+2019)"
>>
>> Which showed in the logs as:
>>
>> "Can't escape \\x{2019}, try uri_escape_utf8() instead at
>> /opt/eprints3/perl_lib/URI/Escape.pm"
>>
>> Importantly if I copy the problem characters to the manual
>> elements record it doesn't pose a problem. There appears some
>> processing to properly encode characters entered via text box,
>> but not characters dragged in from other sources into Elements.
>>
>> I've also had the issue with the files containing "'" or" ";" etc
>> not being accessible via Elements (a very similar, but different
>> problem).
>>
>> I found where I COULD fix the former issue, but it involves
>> changing EPrints code when I SHOULD be altering the Symplectic
>> connector code on the repo server.
>>
>> Anyway, I'm not specifically looking for a solution, but has
>> anybody else experienced anything similar? If so, does it stop
>> with RT2? I hope to raise a ticket with Symplectic over this
>> eventually.
>>
>> Thanks,
>> James
>>
>>
>>
>> *** Options:http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech <http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech>
>> *** Archive:https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C2c361c99870d4fccdf5108d8d341cf9f%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637491626890720682%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=jdD8ujxiE9PEeshMvSqXoNPkE9pX9YBKnctFVHq2ya8%3D&reserved=0 <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C2c361c99870d4fccdf5108d8d341cf9f%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637491626890720682%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=jdD8ujxiE9PEeshMvSqXoNPkE9pX9YBKnctFVHq2ya8%3D&reserved=0>
>> *** EPrints community wiki:https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C2c361c99870d4fccdf5108d8d341cf9f%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637491626890720682%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=AS%2BhE6DdoO4elG6UU8HnML6MdSPUefRckkxP0ivrt%2FE%3D&reserved=0 <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C2c361c99870d4fccdf5108d8d341cf9f%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637491626890720682%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=AS%2BhE6DdoO4elG6UU8HnML6MdSPUefRckkxP0ivrt%2FE%3D&reserved=0>
>
> <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.avg.com%2Femail-signature%3Futm_medium%3Demail%26utm_source%3Dlink%26utm_campaign%3Dsig-email%26utm_content%3Demailclient&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C2c361c99870d4fccdf5108d8d341cf9f%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637491626890730638%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=xa56svKAT2k9zPJS6Oh5HiKFn2CxyzF0t%2BgfjJePorY%3D&reserved=0>
> Virus-free. https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.avg.com%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C2c361c99870d4fccdf5108d8d341cf9f%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637491626890730638%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2FWftf8mk0lUrbwQEbX12KYKm5ChsGWY17lkNrV%2FRrvY%3D&reserved=0
> <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.avg.com%2Femail-signature%3Futm_medium%3Demail%26utm_source%3Dlink%26utm_campaign%3Dsig-email%26utm_content%3Demailclient&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C2c361c99870d4fccdf5108d8d341cf9f%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637491626890730638%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=xa56svKAT2k9zPJS6Oh5HiKFn2CxyzF0t%2BgfjJePorY%3D&reserved=0>
>
>
--
This email has been checked for viruses by AVG.
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.avg.com%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C2c361c99870d4fccdf5108d8d341cf9f%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637491626890730638%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=g7Te%2FsFl9FV7f0am2bzJzLd3DAkpbR73iZa6guhOIdE%3D&reserved=0
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20210217/8aea178c/attachment-0001.html