[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[EP-tech] Re: DSpace import plugin doesn't seem to parse UTF-8 correctly



Hi Tim,

You've been very clear, thanks!

On 19/06/2015 01:17 ??, Timothy Miles-Board wrote:
> Hi George,
>
> Assuming you're using something like:
>
> bin/import foo archive DSpace dspace.url --scripted --force --dump
>
> 1) $r->content;
>
> => returns raw bytes
>
> Looks OK in console - dump() gets bytes and prints bytes (and I think your terminal decodes to UTF8 by default)
>
> Incorrect in record because EPrints writes bytes to utf8 database tables without decoding them.
>
> 2) $r->decoded_content;
>
> => returns the content with any Content-Encoding undone as a string in Perl's internal form
>
> The output of --dump looks a bit weird in console - that's Data::Dumper's representation of characters in internal strings (I think).
>
> If you add "print STDERR $epdata->{suggestions};" instead of using --dump the output looks OK in console but you (as expected) get a wide character warning - so add binmode( STDERR, ":utf8" );
>
> Record also OK because writing internal strings to utf8 database tables is fine.
>
> Hope that helps explain what you are seeing :-)
>
> Regards,
>
> Tim
>
> Timothy Miles-Board
> Web & Repositories Development Specialist, University of London Computer Centre
> 020 7863 1342  |  07742 970 351  | timothy.miles-board at london.ac.uk | @drtjmb
> The University of London is an exempt charity in England and Wales
>
> ________________________________________
> From: eprints-tech-bounces at ecs.soton.ac.uk <eprints-tech-bounces at ecs.soton.ac.uk> on behalf of George Mamalakis <mamalos at eng.auth.gr>
> Sent: 18 June 2015 3:09 PM
> To: eprints-tech at ecs.soton.ac.uk
> Subject: [EP-tech] Re: DSpace import plugin doesn't seem to parse UTF-8 correctly
>
> Tim,
>
> It did work! What is funny, though, is that if I run the import
> statement from the console by giving --debug to see the output, without
> your patch, the correct content is printed in the console (which
> displays utf8), whereas when I run it with your patch, it prints the
> escaped UTF-8 characters (e.g.\x{3b1}\x{3c0}\x{3cc}).
>
> Nonetheless, the point is that it worked, so thank you VERY much!
>
> I filed an issue, as you suggested
> (https://github.com/eprints/eprints/issues/326).
>
> Thanks again for the prompt help!
>
> On 18/06/2015 03:12 ??, Timothy Miles-Board wrote:
>> Hi George,
>>
>> Try this patch:
>>
>> --- a/perl_lib/EPrints/Plugin/Import/DSpace.pm
>> +++ b/perl_lib/EPrints/Plugin/Import/DSpace.pm
>> @@ -235,7 +235,7 @@ sub retrieve_dcq
>>                   return undef;
>>           }
>>
>> -       my $dc = $self->find_dc_pairs( $r->content );
>> +       my $dc = $self->find_dc_pairs( $r->decoded_content );
>>           return undef unless defined $dc;
>>
>>           $self->{errurl} = $self->{errmsg} = undef;
>>
>> If it works, please pay it forward by submitting a bug report at http://github.com/eprints/eprints and adding the patch to the ticket as a proposed fix.
>>
>> Thanks,
>>
>> Tim
>>
>> Timothy Miles-Board
>> Web & Repositories Development Specialist, University of London Computer Centre
>> 020 7863 1342  |  07742 970 351  | timothy.miles-board at london.ac.uk | @drtjmb
>> The University of London is an exempt charity in England and Wales
>>
>> ________________________________________
>> From: eprints-tech-bounces at ecs.soton.ac.uk <eprints-tech-bounces at ecs.soton.ac.uk> on behalf of George Mamalakis <mamalos at eng.auth.gr>
>> Sent: 18 June 2015 12:16 PM
>> To: eprints-tech at ecs.soton.ac.uk
>> Subject: [EP-tech] DSpace import plugin doesn't seem to parse UTF-8 correctly
>>
>> Hello everybody,
>>
>> I am trying to use the DSpace Import plugin from EPrints by giving a URL
>> to the web interface. While the system import the record, the character
>> sets seem to not be imported correctly. My import is performed from a
>> Greek DSpace server, and a test record where should be imported is this:
>> https://dspace.lib.uom.gr/handle/2159/323.
>>
>> If you give the above URL into the DSpace import form, you'll see that
>> the record is imported but the character set is messed up.
>>
>> Thanks all for your help in advance,
>>
>> George.
>>
>> --
>> George Mamalakis
>>
>> IT and Security Officer,
>> Electrical and Computer Engineer (Aristotle Univ. of Thessaloniki),
>> PhD (Aristotle Univ. of Thessaloniki),
>> MSc (Imperial College of London)
>>
>> School of Electrical and Computer Engineering
>> Aristotle University of Thessaloniki
>>
>> phone number : +30 (2310) 994379
>>
>>
>> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
>> *** Archive: http://www.eprints.org/tech.php/
>> *** EPrints community wiki: http://wiki.eprints.org/
>> *** EPrints developers Forum: http://forum.eprints.org/
>>
>> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
>> *** Archive: http://www.eprints.org/tech.php/
>> *** EPrints community wiki: http://wiki.eprints.org/
>> *** EPrints developers Forum: http://forum.eprints.org/
>
> --
> George Mamalakis
>
> IT and Security Officer,
> Electrical and Computer Engineer (Aristotle Univ. of Thessaloniki),
> PhD (Aristotle Univ. of Thessaloniki),
> MSc (Imperial College of London)
>
> School of Electrical and Computer Engineering
> Aristotle University of Thessaloniki
>
> phone number : +30 (2310) 994379
>
>
>
> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
> *** Archive: http://www.eprints.org/tech.php/
> *** EPrints community wiki: http://wiki.eprints.org/
> *** EPrints developers Forum: http://forum.eprints.org/
> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
> *** Archive: http://www.eprints.org/tech.php/
> *** EPrints community wiki: http://wiki.eprints.org/
> *** EPrints developers Forum: http://forum.eprints.org/
>


-- 
George Mamalakis

IT and Security Officer,
Electrical and Computer Engineer (Aristotle Univ. of Thessaloniki),
PhD (Aristotle Univ. of Thessaloniki),
MSc (Imperial College of London)

School of Electrical and Computer Engineering
Aristotle University of Thessaloniki

phone number : +30 (2310) 994379