[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[EP-tech] Elements-EPrints Odd Characters stopping upload



Hi James,

I think you would need to look at this field in the Elements record in 
its database to look how it is being stored differently when there is an 
import compared to where there is manual entry. As you said I think the 
problem is in part that text box entries get parsed and encoded before 
going into the database but imports do not (or at very least the process 
between input and output to the Elements database is different).? It 
would be useful to know how they look different in the Elements database 
as they may assist making EPrints more resilient to unexpected encodings 
in future.

However "\\x{2019}" looks like an escaped version of something that is 
not particularly valid.? If this was "\\u{2019}" this would probably 
work better as \x I think can only be used to represent a standard ASCII 
character that can be only two hex digits like \x3a is a colon ":". \u 
is used for the extended character set (i.e. UTF-16).? \u{2019} in UTF-8 
would be \xE2\x80\x99.

It would be interesting to get a bit more information about your other 
issue with regular quote marks and semi-colons that are part of the 
standard ASCII set rather than an extended characters. These really 
should not be causing a problem.

Regards

David Newman

On 17/02/2021 09:44, James Kerwin via Eprints-tech wrote:
> *CAUTION:* This e-mail originated outside the University of Southampton.
> Hi All,
>
> This is an Elements/EPrints question. Apologies that it isn't purely 
> EPrints, but this is probably the best place to get an answer. I want 
> to know if others experience this or it's some oddity to our setup.
>
> We are using RT1 (for now) and EPrints 3.3.14 (also for now until 
> upgrade). Occasionally we get an Elements record that is from Scopus, 
> PubMed etc. that has an odd character in it that prevents upload. When 
> I look in the Apache logs it tells me the problem. Yesterday's one was 
> the presence of:
>
> ?"Unicode Character ??? (U+2019)"
>
> Which showed in the logs as:
>
> "Can't escape \\x{2019}, try uri_escape_utf8() instead at 
> /opt/eprints3/perl_lib/URI/Escape.pm"
>
> Importantly if I copy the problem characters to the manual elements 
> record it doesn't pose a problem. There appears some processing to 
> properly encode characters entered via text box, but not characters 
> dragged in from other sources into Elements.
>
> I've also had the issue with the files containing "'" or" ";" etc not 
> being accessible via Elements (a very similar, but different problem).
>
> I found where I COULD fix the former issue, but it involves changing 
> EPrints code when I SHOULD be altering the Symplectic connector code 
> on the repo server.
>
> Anyway, I'm not specifically looking for a solution, but has anybody 
> else experienced anything similar? If so, does it stop with RT2? I 
> hope to raise a ticket with Symplectic over this eventually.
>
> Thanks,
> James
>
>
>
> *** Options:http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
> *** Archive:https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.eprints.org%2Ftech.php%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C33220a06523044b1ea9d08d8d32f2fe4%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637491546906367084%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=%2FILKsRI3XjnFWjs72tUW5IqMqEQC4Hhif4b3nIRNA8Q%3D&reserved=0
> *** EPrints community wiki:https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwiki.eprints.org%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C33220a06523044b1ea9d08d8d32f2fe4%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637491546906377076%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=3zfEh%2F78JK0rZ6435F3i4OAmu3ePrZepgQl65se4pSQ%3D&reserved=0


-- 
This email has been checked for viruses by AVG.
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.avg.com%2F&data=04%7C01%7Ceprints-tech%40ecs.soton.ac.uk%7C33220a06523044b1ea9d08d8d32f2fe4%7C4a5378f929f44d3ebe89669d03ada9d8%7C0%7C0%7C637491546906377076%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=lm1X9vXPvDyYMcFa7oNd3aLBXjklFbCl4R%2B45%2F%2Fcjbs%3D&reserved=0
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20210217/f949cace/attachment-0001.html