EPrints Technical Mailing List Archive

Message: #08877


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

Re: [EP-tech] apostrophe in file names of uploaded/deposited file


CAUTION: This e-mail originated outside the University of Southampton.
Hi everyone,

This might be useful for others, I solved the issue with a couple of REGEX:
    $filename =~ s/\x27/=0027/g;
    $filename =~ s/\x22/=0022/g;
to replace the quote and double-quote in what is returned by this function:
     file->get_value("filename")
>From a digital preservation perspective, I think it is significant to note that "filename" in this object:
does not necessarily refer to the "filename" on disk.

What is the function or property (is there one?) in EPrints objects that is identical to the filename of the file as it is on the filesystem?


Tomasz


From: David R Newman <drn@ecs.soton.ac.uk>
Sent: Sunday, February 20, 2022 4:28 PM
To: eprints-tech@ecs.soton.ac.uk <eprints-tech@ecs.soton.ac.uk>; Tomasz Neugebauer <Tomasz.Neugebauer@concordia.ca>
Subject: Re: [EP-tech] apostrophe in file names of uploaded/deposited file
 

Attention This email originates from outside the concordia.ca domain. // Ce courriel provient de l'exterieur du domaine de concordia.ca



Hi Tomasz,

There are two ways to work round this issue.  One has been in EPrints for quite a while, another I introduced in 3.4.3 to help deal retrospectively with this issue.

1. https://wiki.eprints.org/w/Optional_filename_sanitise.pl allows you to set characters that should be removed before a filename is recorded in the database or saved to disk.  I have to admit I did not know about this until fairly recently, so I have not tested how well it will work or solve your problem.  If you look at /opt/eprints3/lib/cfg,d/optional_filename_sanitise.pl there is a function that can be added under $c->{optional_filename_sanitise}.  The default (albeit commented out) function will remove white space, brackets and @ signs into underscores.  You could add a line like below to deal with apostrophes.

$filepath =~ s!\x27!_!g;

2. The new functionality I added for 3.4.3, is to allow files on disk to be found under the filename <fileid>.bin.  This allows you to fix this sort of issue by renaming the file on disk to <fileid>.bin.  Also, you can enable it so that future files are automatically saved in the format <fileid>.bin by setting:

$c->{generic_filenames} = 1;

I would probably advise against doing this on a live repository, especially if you have unusual uploads like uploading multiple files an once through "Upload from URL".  If you want to test this on a development repo, then please do, as any real-world-ish feedback on this feature would be useful.

Regards

David Newman

On 20/02/2022 20:32, Tomasz Neugebauer via Eprints-tech wrote:
CAUTION: This e-mail originated outside the University of Southampton.

Good afternoon!

 

I’m trying to troubleshoot an issue with exporting out a deposited file that has an apostrophe in the filename.

This is the issue: https://github.com/eprintsug/EPrintsArchivematica/issues/40

 

Does EPrints replace apostrophes in filenames on disk with =0027?

Where in the code does that happen?

The URL of the file has the apostrophe, for example:

https://spectrum.library.concordia.ca/id/eprint/7066/1/Services_techniques_a_l'Universite_Concordia.pdf!

But unlike other Unicode characters, the apostrophe doesn’t make it into the file name on disk, and is substituted with =0027.

I’m looking for confirmation that this is how it is “supposed” to work, and for an understanding where this happens in the code, so that I might ultimately know how many OTHER characters are replaced in this way in the filename?

 

Tomasz

 


*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: http://www.eprints.org/tech.php/
*** EPrints community wiki: http://wiki.eprints.org/