EPrints Technical Mailing List Archive

Message: #08438


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

Re: [EP-tech] Word Documents won't download


CAUTION: This e-mail originated outside the University of Southampton.
Right so where I'm at; I can access embargoed and non embargoed .doc AND .docx files both through Elements and the repository. I still can't access that one particular item through elements. This is far more progress than I could have hoped for when this was pointed out to me. The file URLs start with https on both systems, which is good.

I'm hoping there's some sort of cache issue within Elements. It is definitely the case that I couldn't access any word documents on either the repo or Elements and I now can, apart from the original problem item.

I shall call this a success! Thank you all again for the advice, pointers and patience. I'll uncomment those lines from 10_core.pl and make the advised changes to 20_baseurls.pl. Hopefully we finally get to upgrade the server and EPrints this year before we fall too far behind.

Thanks,
James

On Tue, Jan 5, 2021 at 1:01 PM David R Newman <drn@ecs.soton.ac.uk> wrote:

Hi James,

Something being under embargo is likely to redirect you to HTTPS as you need to login or be logged in to access the document.  So that probably is why you can download the .docx and not the .doc.

Regards

David Newman

On 05/01/2021 12:47, James Kerwin via Eprints-tech wrote:
CAUTION: This e-mail originated outside the University of Southampton.
Okay so a small update. Sorry for bombarding the list with this, but it's plausible that others may discover they have the same nasty problem and my step-by-step blundering through the repo config might be useful for a change.

The "test" word document doesn't download from Elements (the eprint record I included above somewhere) but another one now does! So I THINK I'm almost there with this issue.

The only difference between the two is one is .doc (won't download) and the other is.docx (will download) which I don't think should be the cause of it. The docx is under embargo while the other is not, which is going to be my next thing to investigate.

On Tue, Jan 5, 2021 at 12:14 PM James Kerwin via Eprints-tech <eprints-tech@ecs.soton.ac.uk> wrote:
CAUTION: This e-mail originated outside the University of Southampton.
Hi John,

I am glad you sent this because I don't have a 20_baseurls.pl in my archive config, but after you mentioned it I've just discovered it exists in the default cfg. Pretty sure mine is the old version as it checks for host first and if that isn't set looks for securehost.

In an act of desperate experimentation, before discovering this baseurls file, I commented out:

#$c->{host} = 'livrepository.liverpool.ac.uk';
#$c->{port} = 80;

This DOES give me a https in the elements file link where it previously gave me http. I assumed this would fix it, but it hasn't! I click and see this text in my browser:


I'm at a little bit of a loss now that hasn't fixed it.This time I have no console clues to go on either. I've checked the RT1 settings in Elements and (surprisingly) wherever they mention the repo url it's already https.

I shall keep thinking thoughts and hopefully a solution will magically present itself.

Thanks,
James

On Tue, Jan 5, 2021 at 11:31 AM John Salter <J.Salter@leeds.ac.uk> wrote:

Hi James,

For the 'doesn't resolve the access via Elements' aspect - is the URL in Elements http (rather than https)?

 

The data that Elements collects for a publication comes from e.g. https://livrepository.liverpool.ac.uk/rt4eprints/publication/[SYMPLECTIC PID]

That file is compiled via Symplectic::Atom::AtomSerialiser - and (via a few layers) ends up calling $document->get_url().

 

If the links in the Symp Atom representation are currently http, then I think you'll need to make EPrints create https URLs by default.

You might want protocol-less URLs in the user interface, but I'm not sure whether they would confuse any processing at the Symplectic end of things - so you probably need to make your base URL default to https.

Somewhere between 10_core.pl and 20_baseurls.pl you can make it default to https.

 

It might be worth comparing your live version of this https://github.com/eprints/eprints/blob/3.3/lib/defaultcfg/cfg.d/20_baseurls.pl file to that one.

It was updated to prefer https a few years ago, but as the file gets copied into your repo config it may well be that you've got an older version that prefers http.

 

Cheers,

John

 

 

From: eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk] On Behalf Of James Kerwin via Eprints-tech
Sent: 05 January 2021 09:32
To: eprints-tech@ecs.soton.ac.uk
Subject: Re: [EP-tech] Word Documents won't download

 

CAUTION: This e-mail originated outside the University of Southampton.

Apologies, the link with the solution for the front-end is this one:

 

 

The stack overflow one is related (I currently have so many tabs open, hence the confusion).

 

On Tue, Jan 5, 2021 at 9:12 AM James Kerwin via Eprints-tech <eprints-tech@ecs.soton.ac.uk> wrote:

CAUTION: This e-mail originated outside the University of Southampton.

Morning all,

 

Don, John and David; thank you so much for setting me on the right path. I had zero chance of sorting this out without your help. Especially since I forgot that the console existed in Chrome...

 

I read this blog post that explains my repo-woes:

 

 

I found this "solution" that appears to work on the repository side of things:

 

 

I've so far opted for the addition of the meta tag in ../cfg/templates/default.xml:

 

meta http-equiv="Content-Security-Policy" content="upgrade-insecure-requests"

 

I appreciate this isn't the BEST solution, but it's the most immediate one that gets some people off my back until I can implement a proper one. It does not solve the issue of accessing the file via Elements, which is what I expected.

 

It was one of my predecessors or CSD that set up HTTPS in our repository and it hasn't been done in the standard "EPrints way". I thought it worked well by redirecting people to HTTPS. I had previously tried replicating this set-up in our data repository, but I'm glad I didn't now. The data repo doesn't go to https until logging in and it often generates emails from users along the lines of "your website isn't secure!".

 

Anyway, thank you once more for the help and advice. I will update you all on how I get on.

 

Thanks,

James

 

On Mon, Jan 4, 2021 at 10:25 PM David R Newman <drn@ecs.soton.ac.uk> wrote:

Hi John,

I have just tested you config change but it does not seem to work on the abstract page of the repository I have been testing on.  My recommendation would be to set the following config option at the end of 10_core.pl to make the URLs protocol-relative:

$c->{http_url} = '//' . $c->{host} . '/';

$c->{http_cgiurl} = '//' . $c->{host} . '/cgi/';

This sets a protocol relative URL rather than an http one.  You could alternativerly set to 'https://' . $c->{host} . '/'; to just make all URLs https.  If you only set to the protocol-relative option then there is a minor issue that the EPrint::View page for live items with be display the URL as //HOSTNAME/12345/ rather than https://HOSTNAME/12345/ which may be confusing to some users as they would expect it to start http or https.  Also default abstract/summary pages will display the URI as protocol-relative at the end of the summary table.  These are issues I have been trying to address for adding robust protocol-relative URL support to EPrints 3.4.3.

The motivation to switch to procotol-relative URLs is that it saves a wholesale switch from http to https URLs with redirects that I have noted often causes a dip in Google-indexing and download stats for up to a month or so.  An explanation about why this happens can be found at:

https://wiki.eprints.org/w/Simplified_HTTPS_Configuration#Issues_and_Troubleshooting 

I don't think using https in the http_url configuration option will affect the download stats that much, as it won't lead to and http to https redirect that is the predominant factor in lowering download stats.  It will however, change the URIs for eprint items, which may have an affect on Google indexing.  However, this is very much dependent on how Google seeks out the URLs to be indexed, which is multivarious.

Regards

David Newman

On 04/01/2021 18:02, John Salter wrote:

CAUTION: This e-mail originated outside the University of Southampton.

I've just re-checked my config files.

For 3.3.x, if you include (in e.g. 10_core.pl):

    $c->{http_root} = undef;

It will make thumbnails/download links relative rather than absolute.

I think there was more to it that that though - if you're creating downloadable content (e.g. coversheets) , you want it to render the full links (using with https by defult).

 

Cheers,

John

 

From: eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk] On Behalf Of David R Newman via Eprints-tech
Sent: 04 January 2021 17:44
To: John Salter <J.Salter@leeds.ac.uk>; eprints-tech@ecs.soton.ac.uk; James Kerwin <jkerwin2101@gmail.com>
Subject: Re: [EP-tech] Word Documents won't download

 

Hi all,

So, the problem is the URL generated by EPrints compiled XML's cite:linkhere which uses http rather than https.  The suggestion John's makes about https://wiki.eprints.org/w/Simplified_HTTPS_Configuration#HTTPS_Only will only work if you are running EPrints 3.4.1+, which I can see that you are not.

One of the features I have been working on for 3.4.3 is protocol relative URLs which should help deal with these issues.  If you are still running 3.3.x fixing these sort of problems with be tricky.  I think you need to look at the various document citations and possible eprint_render.pl and replace the http URLs with https URLs.  In some cases the http URL will come from the <cite:linkshere>, which you will probably need to hack with a fix like:

<a href=""

and </cite:linkhere> with </a>

Hope this helps

David Newman

On 04/01/2021 17:20, John Salter wrote:

CAUTION: This e-mail originated outside the University of Southampton.

> I was just about to chime in that the document URL is rendered with http - but you're redirecting to https - so some part of Chrome's 302 handing is possibly confusing things…

 

To flesh that out a bit more:

White Rose is currently still available over http and https. Document links are relative - so match the protocol you're visiting the site from.

 

For LivRepo, it looks like you're using an HSTS setup so requests to http:// are redirected to https:// (via a 307 response).

 

If you update the download URL to use https (via Chrome Console / Inspect), it downloads fine.

 

To fix this in EPrints, https://wiki.eprints.org/w/Simplified_HTTPS_Configuration#HTTPS_Only - setting 'host' to undef is the key - although test this thoroughly first - can't remember if there's any related 'fun' with Symplectic connector if you do this (I don't think there is…).

 

Cheers,

John

 

 

From: eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk] On Behalf Of David R Newman via Eprints-tech
Sent: 04 January 2021 17:02
To: James Kerwin <jkerwin2101@gmail.com>
Cc: eprints-tech@ecs.soton.ac.uk
Subject: Re: [EP-tech] Word Documents won't download

 

Hi James,

You should note that the whiterose URL is http rather than https.  I have tested (on Chrome) the same URL I was testing before but with http rather than https and this worked just fine.  This is starting to suggest to me some security feature (albeit maybe a bit broken) within Chrome.  From the depths of my brain I am vaguely recall some issue to do with content length mismatches that exhibited similar symptoms.

Regards

David Newman

On 04/01/2021 16:53, James Kerwin wrote:

CAUTION: This e-mail originated outside the University of Southampton.

Hi David,

 

Thanks for your response. it's good to know it's not just me (although I did ask my family to also attempt to download it and they all struggled).

 

To add to the confusion, this item on the White Rose repository downloads fine. Unless Mr Salter has some different setup, I'm afraid it only adds to my quiet terror:

http://eprints.whiterose.ac.uk/160018/

 

What I have noticed is that his filenames in the word documents have no spaces. I'm currently mooching through our EPrints database for a doc(x) file that also avoids spaces. This isn't the most scientific way to work it out, but I'm hoping it yields some results...

 

This sort of problem landing at my feet on the first day back at work should be considered some sort of abuse of my human rights!

 

If by some miracle I find a cause or solution I will share it.

 

Thanks,

James

 

On Mon, Jan 4, 2021 at 4:45 PM David R Newman <drn@ecs.soton.ac.uk> wrote:

Hi James,

I see the same behaviour on your repository for the Word document on the URL you provided.  Similarly it works fine on FireFox but has problems on Chrome when you click on the link and don't try to download it another tab.  Oddly, if I try a second time I get a popup asking me if I want to allow downloads of multiple files.  I have tested on a different repository and I see the same issue with both .doc and .docx files.  I suspect there may be issues with all application/... mime type files.  My best guess is this is a new security feature from Chrome.  It may be something that requires tweaking Apache's configuration or possibly even something within EPrints.

I have also tested on Edge and Opera (all browsers running on Windows 10) and I do not have any issues either.  The Chrome version I am running is 87.0.4280.88, this looks to have been released at the beginning of December.   I do not know when my browser upgraded but there are currently no knew updates available for Chrome according to my browser's "About Google Chrome" page.  I will continue to investigate and get back to the list if I find out anything more.

Regards

David Newman

On 04/01/2021 15:54, James Kerwin via Eprints-tech wrote:

CAUTION: This e-mail originated outside the University of Southampton.

Hi All,

 

Happy new year etc. Hope everyone is well.

 

I have a problem that has appeared today and it was fine before 18/12/2020 (as in it worked as expected).

 

Word documents are not downloading on the repository when using Chrome. If I right the download link and open it in a new tab it works and the file is downloaded. PDFs are behaving fine. If I use FireFox and click the download link the file will download, but it does prompt me whether I want to save or open (this is fine, I don't use FF much so I won't click the "don't ask again" option).

 

In Elements on both Chrome and FF the file will not download. PDFs are downloading through Elements fine.

 

Is there a likely cause for this? Some sort of update to some obscure working of the internet?

 

Example record:

 

 

Any help will be gratefully received. I'm totally confused by it and don't know where to start.

 

Thanks,

James



*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: http://www.eprints.org/tech.php/
*** EPrints community wiki: http://wiki.eprints.org/

 

Image removed by
                                                          sender.

Virus-free. www.avg.com

 

*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: http://www.eprints.org/tech.php/
*** EPrints community wiki: http://wiki.eprints.org/

*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: http://www.eprints.org/tech.php/
*** EPrints community wiki: http://wiki.eprints.org/

*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: http://www.eprints.org/tech.php/
*** EPrints community wiki: http://wiki.eprints.org/