EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #10132


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

Re: [EP-tech] DDoS of EPrints advanced search


CAUTION: This e-mail originated outside the University of Southampton.
Hi everyone,

I've seen some of the same repeat requests on our advanced search reported on this thread, although more moderate volume.  Our IT security team made some tweaks to firewall that have really helped to dramatically slow down the rate at which these repeat requests get through.  For those that do get through they get a 403 right away from Apache, as I added a regex to the apache config along the lines of what David suggested, based on what I saw in the logs.  For the time being, that has been sufficient.  My intuition is that this sort of issue of repeat requests / DDoS attacks, should be dealt with at that firewall level. In terms of pages that are vulnerable, although not targeted, I worry about the IRStats2, considering putting that behind a login, but at the same time, it's great to have that data available openly so I hesitate.  

As I was monitoring the situation, it struck me just how much Gen AI crawling is happening on our repository, from Open AI, Bing/copilot, etc. The IR is a central/important data infrastructure for these services that are themselves in the midst of litigation regarding copyright and fair use.

Tomasz


________________________________________________

Tomasz Neugebauer
Senior Librarian | Bibliothécaire titulaire
Digital Projects & Systems Development Librarian / Bibliothécaire des Projets Numériques & Développement de Systèmes
Concordia University / Université Concordia

Tel. / Tél. 514-848-2424 ext. / poste 7738
Email / courriel:
tomasz.neugebauer@concordia.ca

Mailing address / adresse postale: 1455 De Maisonneuve Blvd. W., LB-540-03, Montreal, Quebec H3G 1M8
Street address / adresse municipale: 1400 De Maisonneuve Blvd. W., LB-540-03, Montreal, Quebec H3G 1M8

library.concordia.ca


From: eprints-tech-request@ecs.soton.ac.uk <eprints-tech-request@ecs.soton.ac.uk> on behalf of Martin Brändle <martin.braendle@uzh.ch>
Sent: Thursday, June 5, 2025 9:52 AM
To: eprints-tech@ecs.soton.ac.uk <eprints-tech@ecs.soton.ac.uk>; David R Newman <drn@ecs.soton.ac.uk>
Subject: Re: [EP-tech] DDoS of EPrints advanced search
 

Attention This email originates from outside the concordia.ca domain. // Ce courriel provient de l'extérieur du domaine de concordia.ca




CAUTION: This e-mail originated outside the University of Southampton.
CAUTION: This e-mail originated outside the University of Southampton.

Hi Matthew,

 

not sure if separating search would really help. One just shifts load to another system. Even with our Elasticsearch implementation (https://github.com/eprintsug/EPrintsElasticsearch) we see hackers trying to spoof the URL and use excessive crawling, on other implementations they also try to post random queries.

 

Kind regards,

 

Martin

 

--

Dr. Martin Brändle
Zentrale Informatik
Universität Zürich
Pfingstweidstrasse 60B
CH-800
5 Zürich

 

 

 

From: eprints-tech-request@ecs.soton.ac.uk <eprints-tech-request@ecs.soton.ac.uk> on behalf of Matthew Kerwin <matthew.kerwin@qut.edu.au>
Date: Tuesday, 3 June 2025 at 06:27
To: eprints-tech@ecs.soton.ac.uk <eprints-tech@ecs.soton.ac.uk>, David R Newman <drn@ecs.soton.ac.uk>
Subject: RE: [EP-tech] DDoS of EPrints advanced search

CAUTION: This e-mail originated outside the University of Southampton.

CAUTION: This e-mail originated outside the University of Southampton.

We have actually been discussing, at a blue-sky level so nothing serious, if it would be technically possible to entirely separate the search functionality out of eprints. We already have a separate search engine built in, but if the indexer process was responsible for pulling data from the repo DB and putting it into a distinct search DB, with its own interfaces and whatnot, we could partition the two different types of load.

Just an idea. (I'd also like to do it with the access tables, since IRstats is *almost* doing that already, in the same vein)

FWIW our repo is behind an F5 load balancer, and for now I've set up a simple iRule that blocks all access to /cgi/search from external IP addresses, at least until we can work out what to do moving forward.

Cheers
--
Matty Kerwin (he/him)
Software Engineer
Education & Research
Digital Business Solutions

Queensland University of Technology
Email: matthew.kerwin@qut.edu.au
KG-X232, Kelvin Grove Campus


-----Original Message-----
From: eprints-tech-request@ecs.soton.ac.uk <eprints-tech-request@ecs.soton.ac.uk> On Behalf Of Florian Heß
Sent: Monday, 2 June 2025 19:23
To: David R Newman <drn@ecs.soton.ac.uk>; eprints-tech@ecs.soton.ac.uk
Subject: Re: [EP-tech] DDoS of EPrints advanced search

CAUTION: This e-mail originated outside the University of Southampton.

CAUTION: This e-mail originated outside the University of Southampton.

Hi David,

we are running MySQL 8.0.42 :-)

As an alternative caching mechanism I would like to suggest to consider files. Files are cached by the IO subsystem and can be cleaned independently, but I think not inferring with EPrints logic (read-access of a file stays safe when the inode link is removed a blink later).


Regards
F Heß


Am 02.06.25 um 11:09 schrieb David R Newman:
> Hi Florian,
>
> Out of interest, what version of MySQL/MariaDB are you running? When
> we were running CentOS 7 that had MariaDB 5.5 we found the issue with
> particularly complex SQL queries (certain advanced searches) that
> would take a long time to run and there would be a query trying to
> drop the 'cache' table that would get queued whilst the original query
> that created the cache table was still running.  Since we have moved
> to Rocky Linux 9 that runs MariaDB 10.5, we have not had as any
> significant problems like this. However, until this unhelpful bot
> behaviour started, it would have been very unusual for 100+ searches
> to be made in only a few minutes.  More normally it might take tens of
> minutes, if not hours for that number of searches.
>
> In an ideal world we would like to remove 'cache' tables for search in
> future versions of EPrints as modern MySQL/MariaDB can do this quite
> well natively, if suitably configured.  However, the way 'cache'
> tables are create a quite ingrained into EPrints, if they were removed
> we would need to ensure that what MySQL/MariaDB gets handed as a query
> sufficiently matches something it has cached natively.
>
> Regards
>
> David Newman
>
> On 02/06/2025 09:45, Florian Heß wrote:
>> CAUTION: This e-mail originated outside the University of Southampton.
>>
>> CAUTION: This e-mail originated outside the University of Southampton.
>>
>> Hi John,
>>
>> in addition to that we also experience apparently regular race
>> conditions between selecting from and dropping cache tables, that may
>> lock database access which has actually happened quite often. After
>> killing the mysql process that runs for a long time (`mysql> show
>> full processlist;`), all waiting requests will be processed.
>>
>>
>> Kind regards
>> Florian
>>
>> Am 30.05.25 um 15:39 schrieb John Salter:
>>> I added a script to my server to log the number of search cache
>>> tables, and the min/max IDs of them for each hour.
>>> I plan to use this to redirect requests with 'old' cache ids in the
>>> query-string to a static page, which will describe (to a human) how
>>> to re-run their search, but not provide a clickable link to do so.
>>> If others are also seeing this pattern, I can share my stuff once
>>> it's ready.
>>
>>
>> *** Options:
>> https://urldefense.com/v3/__https://wiki.eprints.org/w/Eprints-tech_M
>> ailing_List__;!!NVzLfOphnbDXSw!A3Yar9cWlhNqlrNo29Q60ObENDT-dVql2Ysf4v
>> GvpHmeDxOI9hrghTd6hAvr4A_wSre_G3ijqFCOY_x9tklDbGSJlQa8yjA$
>> *** Archive:
>> https://urldefense.com/v3/__https://www.eprints.org/tech.php/__;!!NVz
>> LfOphnbDXSw!A3Yar9cWlhNqlrNo29Q60ObENDT-dVql2Ysf4vGvpHmeDxOI9hrghTd6h
>> Avr4A_wSre_G3ijqFCOY_x9tklDbGSJnnuLN3w$
>> *** EPrints community wiki:
>> https://urldefense.com/v3/__https://wiki.eprints.org/__;!!NVzLfOphnbD
>> XSw!A3Yar9cWlhNqlrNo29Q60ObENDT-dVql2Ysf4vGvpHmeDxOI9hrghTd6hAvr4A_wS
>> re_G3ijqFCOY_x9tklDbGSJO3P9zIg$
>>