EPrints Technical Mailing List Archive

Message: #09027


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

Re: [EP-tech] OAI Harvester broken by new security


CAUTION: This e-mail originated outside the University of Southampton.

Just another quick thought: most harvesters present a user-agent string of either:

- something useful e.g. 'IRUS_metadata_harvesting_bot' or ' Unpaywall (http://unpaywall.org/; mailto:team@impactstory.org)'

- something software-y e.g. ' Apache-HttpClient/4.5.1 (Java/11.0.15) ', 'pyoai' or 'GuzzleHttp/6.5.5 curl/7.58.0 PHP/7.4.29'

 

These could also be triggering a WAF (or similar mechanism) to say 'no'.

 

As the requests are currently being blocked, they probably aren't reaching your Apache logs, but you could check older logs with something like this (assuming you're using the common log format) to get a list of user-agents hitting the OAI endpoint, and how many times they've been:
you@server> grep 'oai2' /path/to/the/apache/access.log | cut -d\" -f6 | sort | uniq -c | sort -n

 

The 'use a double-quote as a delimiter' feels a bit hacky - but in this case I think is easier than splitting on whitespace or another character!

 

Cheers,

John

 

From: eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk] On Behalf Of John Salter via Eprints-tech
Sent: 09 August 2022 10:11
To: eprints-tech@ecs.soton.ac.uk; James Kerwin <jkerwin2101@gmail.com>
Subject: Re: [EP-tech] OAI Harvester broken by new security

 

CAUTION: This e-mail originated outside the University of Southampton.

Hi James,
I'm guessing the 'security changes' include a WAF (web application firewall) or similar?

 

The OAI-PMH resumptionToken isn't that complicated - essentially parameters that can be passed to the script directly are URL-encoded.

I can see how this might trigger some WAF rules.

 

I think the main approaches are:-

- whitelist the OAI-PMH endpoint in the WAF

- whitelist harvested in the WAF (you might not know all harvesters that visit your repo though!)

- create a ruleset for the OAI-PMH vocabulary to be included in the WAF

 

The nature of an OAI-PMH harvest could look very much like a bad-actor probing your server.

The nature of the response payload could also mean the harvest creates peaks in server usage, which could make automated tooling connect the OAI-PMH requests to a DOS style attack.

 

Without knowing exactly what's at play it's difficult to make more refined suggestions.

Happy to have an off-list discussion about this, seeing as it's security-related.

 

Cheers,

John

 

From: eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk] On Behalf Of James Kerwin via Eprints-tech
Sent: 09 August 2022 09:57
To: eprints-tech@ecs.soton.ac.uk
Subject: [EP-tech] OAI Harvester broken by new security

 

CAUTION: This e-mail originated outside the University of Southampton.

Hello all,

 

Hope everyone is doing well. 

 

This isn't a specific EPrints problem, but as you all use EPrints there may be some experience...

 

We've had some security changes at the uni recently. Some of these result in us clicking buttons in EPrints and then we get taken to our IT Services security page. So far we've handled this by accessing via the university network (e.g. VPN).

 

This issue has now hit our OAI harvester. Specifically under "ListRecords" when we click the "Resume" button (https://livrepository.liverpool.ac.uk/cgi/oai2?verb=ListRecords&metadataPrefix=oai_dc). Currently the organisations that usually harvest our content are unable to. I have spoken with our IT Services team to find a solution. Has anybody else experienced similar issues at their organisations and are there any steps you think I can take to resolve it?

 

It doesn't help that I don't know how resumption tokens work. I assume they are stored in a database somewhere? Or a file? The other incidences of this in the repository occur when making changes to file metadata, though not EPrint record metadata.

 

Thanks,

James