EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #10131


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

Re: [EP-tech] DDoS of EPrints advanced search


CAUTION: This e-mail originated outside the University of Southampton.

Hi Matthew,

 

not sure if separating search would really help. One just shifts load to another system. Even with our Elasticsearch implementation (https://github.com/eprintsug/EPrintsElasticsearch) we see hackers trying to spoof the URL and use excessive crawling, on other implementations they also try to post random queries.

 

Kind regards,

 

Martin

 

--

Dr. Martin Brändle
Zentrale Informatik
Universität Zürich
Pfingstweidstrasse 60B
CH-800
5 Zürich

 

 

 

From: eprints-tech-request@ecs.soton.ac.uk <eprints-tech-request@ecs.soton.ac.uk> on behalf of Matthew Kerwin <matthew.kerwin@qut.edu.au>
Date: Tuesday, 3 June 2025 at 06:27
To: eprints-tech@ecs.soton.ac.uk <eprints-tech@ecs.soton.ac.uk>, David R Newman <drn@ecs.soton.ac.uk>
Subject: RE: [EP-tech] DDoS of EPrints advanced search

CAUTION: This e-mail originated outside the University of Southampton.

CAUTION: This e-mail originated outside the University of Southampton.

We have actually been discussing, at a blue-sky level so nothing serious, if it would be technically possible to entirely separate the search functionality out of eprints. We already have a separate search engine built in, but if the indexer process was responsible for pulling data from the repo DB and putting it into a distinct search DB, with its own interfaces and whatnot, we could partition the two different types of load.

Just an idea. (I'd also like to do it with the access tables, since IRstats is *almost* doing that already, in the same vein)

FWIW our repo is behind an F5 load balancer, and for now I've set up a simple iRule that blocks all access to /cgi/search from external IP addresses, at least until we can work out what to do moving forward.

Cheers
--
Matty Kerwin (he/him)
Software Engineer
Education & Research
Digital Business Solutions

Queensland University of Technology
Email: matthew.kerwin@qut.edu.au
KG-X232, Kelvin Grove Campus


-----Original Message-----
From: eprints-tech-request@ecs.soton.ac.uk <eprints-tech-request@ecs.soton.ac.uk> On Behalf Of Florian Heß
Sent: Monday, 2 June 2025 19:23
To: David R Newman <drn@ecs.soton.ac.uk>; eprints-tech@ecs.soton.ac.uk
Subject: Re: [EP-tech] DDoS of EPrints advanced search

CAUTION: This e-mail originated outside the University of Southampton.

CAUTION: This e-mail originated outside the University of Southampton.

Hi David,

we are running MySQL 8.0.42 :-)

As an alternative caching mechanism I would like to suggest to consider files. Files are cached by the IO subsystem and can be cleaned independently, but I think not inferring with EPrints logic (read-access of a file stays safe when the inode link is removed a blink later).


Regards
F Heß


Am 02.06.25 um 11:09 schrieb David R Newman:
> Hi Florian,
>
> Out of interest, what version of MySQL/MariaDB are you running? When
> we were running CentOS 7 that had MariaDB 5.5 we found the issue with
> particularly complex SQL queries (certain advanced searches) that
> would take a long time to run and there would be a query trying to
> drop the 'cache' table that would get queued whilst the original query
> that created the cache table was still running.  Since we have moved
> to Rocky Linux 9 that runs MariaDB 10.5, we have not had as any
> significant problems like this. However, until this unhelpful bot
> behaviour started, it would have been very unusual for 100+ searches
> to be made in only a few minutes.  More normally it might take tens of
> minutes, if not hours for that number of searches.
>
> In an ideal world we would like to remove 'cache' tables for search in
> future versions of EPrints as modern MySQL/MariaDB can do this quite
> well natively, if suitably configured.  However, the way 'cache'
> tables are create a quite ingrained into EPrints, if they were removed
> we would need to ensure that what MySQL/MariaDB gets handed as a query
> sufficiently matches something it has cached natively.
>
> Regards
>
> David Newman
>
> On 02/06/2025 09:45, Florian Heß wrote:
>> CAUTION: This e-mail originated outside the University of Southampton.
>>
>> CAUTION: This e-mail originated outside the University of Southampton.
>>
>> Hi John,
>>
>> in addition to that we also experience apparently regular race
>> conditions between selecting from and dropping cache tables, that may
>> lock database access which has actually happened quite often. After
>> killing the mysql process that runs for a long time (`mysql> show
>> full processlist;`), all waiting requests will be processed.
>>
>>
>> Kind regards
>> Florian
>>
>> Am 30.05.25 um 15:39 schrieb John Salter:
>>> I added a script to my server to log the number of search cache
>>> tables, and the min/max IDs of them for each hour.
>>> I plan to use this to redirect requests with 'old' cache ids in the
>>> query-string to a static page, which will describe (to a human) how
>>> to re-run their search, but not provide a clickable link to do so.
>>> If others are also seeing this pattern, I can share my stuff once
>>> it's ready.
>>
>>
>> *** Options:
>> https://urldefense.com/v3/__https://wiki.eprints.org/w/Eprints-tech_M
>> ailing_List__;!!NVzLfOphnbDXSw!A3Yar9cWlhNqlrNo29Q60ObENDT-dVql2Ysf4v
>> GvpHmeDxOI9hrghTd6hAvr4A_wSre_G3ijqFCOY_x9tklDbGSJlQa8yjA$
>> *** Archive:
>> https://urldefense.com/v3/__https://www.eprints.org/tech.php/__;!!NVz
>> LfOphnbDXSw!A3Yar9cWlhNqlrNo29Q60ObENDT-dVql2Ysf4vGvpHmeDxOI9hrghTd6h
>> Avr4A_wSre_G3ijqFCOY_x9tklDbGSJnnuLN3w$
>> *** EPrints community wiki:
>> https://urldefense.com/v3/__https://wiki.eprints.org/__;!!NVzLfOphnbD
>> XSw!A3Yar9cWlhNqlrNo29Q60ObENDT-dVql2Ysf4vGvpHmeDxOI9hrghTd6hAvr4A_wS
>> re_G3ijqFCOY_x9tklDbGSJO3P9zIg$
>>