EPrints Technical Mailing List Archive

See the EPrints wiki for instructions on how to join this mailing list and related information.

Message: #10173


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

Re: [EP-tech] DDoS on simple and advanced search


Hi John, Tomasz and others,

I have been reviewing how to get mod_security to do what I want, which is to start 429-ing (Too Many Requests) any service that is getting hammered by bots.  As we know, this is a DDoS, so the rules need to cover all rather than individual IP addresses.  This is not ideal, as it could block genuine requests. So it is important to get the depreciation on the counters right, to only 429 when not doing so would take the repository offline (or at best make it very unresponsive), whilst not leaving services (e.g. search, export and exportview) perpetually unavailable. 

I believe the following configuration (to /etc/httpd/modsecurity.d/local_rules/modsecurity_localrules.conf on a RHEL 8/9 or similar Linux OS) is hopefully following that middle ground but depending on the resources your EPrints repository server has you may want to tweak some of these parameters:

<LocationMatch "^/cgi/search/(archive/)?(advanced|simple)">
  SecAction id:210001,initcol:ip=0.0.0.0,pass,nolog
  SecRule REQUEST_URI "^/cgi/" "id:210002,phase:2,nolog,setvar:ip.searchcntr=+1,deprecatevar:ip.searchcntr=20/12"
  SecRule IP:SEARCHCNTR "@gt 100" "phase:2,id:210003,deny,status:429,setenv:RATELIMITED,skip:1,log"
  Header always set Retry-After "12" env=RATELIMITED
</LocationMatch>

<LocationMatch "^/cgi/export/">
  SecAction id:210011,initcol:ip=0.0.0.0,pass,nolog
  SecRule REQUEST_URI "^/cgi/" "id:210012,phase:2,nolog,setvar:ip.expcntr=+1,deprecatevar:ip.expcntr=20/12"
  SecRule IP:EXPCNTR "@gt 100" "phase:2,id:210013,deny,status:429,setenv:RATELIMITED,skip:1,log"
  Header always set Retry-After "12" env=RATELIMITED
</LocationMatch>

<LocationMatch "^/cgi/exportview/">
  SecAction id:210021,initcol:ip=0.0.0.0,pass,nolog
  SecRule REQUEST_URI "^/cgi/" "id:210022,phase:2,nolog,setvar:ip.expvcntr=+1,deprecatevar:ip.expvcntr=20/12"
  SecRule IP:EXPVCNTR "@gt 100" "phase:2,id:210023,deny,status:429,setenv:RATELIMITED,skip:1,log"
  Header always set Retry-After "12" env=RATELIMITED
</LocationMatch>

Technically the LocationMatch is not needed as the SecRule REQUEST_URI could do this job.  However, for readability and debugging using the LocationMatch keeps things tidier, if not necessarily normal practice.  I am not experienced with potential nuances in SecRule regexps, so I kept this simple with just "^/cgi/".   As you need to make this a SecRule and have something to match against otherwise mod_security throws the following warning message:

ModSecurity: Warning. Unconditional match in SecAction.

As you can see there are separate rule chains for search, export and exportview.  I have contemplated adding view as well but in theory that should be cached.  

The first line inside every location match is a bit of a hack.  You would normally do something like "initcol:ip=%{REMOTE_ADDR}" but as we want to match all IPs, due to the DDoS nature of the attacks, then I have gone for a placeholder of '0.0.0.0'.

The second line is just a perfunctory match but the remainder of the line is the interesting bit. setvar does the incrementing and deprecatevar does the depreciation of the counter over time, in this case 20 every 12 seconds.

The third line enforces the 429 when the number of request is greater than 100.  I have have also added the following line to my code and designed an EPrints' lang/en/static/rate_limited.xpage to display a message saying this service is "getting too many requests right now".

ErrorDocument 429 /rate_limited.html

The rest of the third line says that you should deny the request and send a 429 (Too Many Requests) response and set the environment RATELIMITED, which is just a useful hint to tell the client it is worth trying again 12 seconds later after the latest depreciation.  If you have mod_security logging switched on and at a suitable log level, you should be able to get logging for which requests and from what IP are getting 429-ed.  You could also grep the logs for 429 responses in the access logs, (although that can require a clever regexp to not match responses of size 429 bytes).

I have been running this on several of repositories today and although I have been still seeing high numbers of Apache processes / MySQL connections on occasion, I have not seen elevated CPU load or poor responsiveness on a monitoring check that I have setup to regularly request the homepage of these repositories and report back how long it takes to get a response.

This is only early days, so I expect some tweaking of the parameters will be needed and inclusion of extra paths, (e.g. /cgi/stats/... and possibly /views/...).  I will keep people post on my results and update the mod_security configuration I need to make.

Regards

David Newman

On 21/07/2025 10:21, John Salter wrote:
CAUTION: This e-mail originated outside the University of Southampton.
CAUTION: This e-mail originated outside the University of Southampton.

Hi All,

An aspect of the way EPrints deals with search requests may be causing us some of these issues.

What is happening:

  • Something searches your site
  • It extracts all the links from the results page – including the paginated links
  • These links are saved, and at some point (weeks/months in the future) will be requested from a network of devices
  • Each link contains both the cacheid of the original search, and all the parameters needed to re-run the search

 

When the paginated links are farmed out to the network of devices (hence the spread of IP addresses), the original EPrints search cache has expired.

Each paginated link then triggers the same original search to be run – with each request making a new cache table.

 

If the original search _expression_ returned 1000 results, presented with 20 links on each page, the follow-up crawl of those paginated links will cause 50 new individual cache tables.

 

I’ve documented it here: https://github.com/eprints/eprints3.4/issues/479 .

 

A quick short-term fix would be to stop EPrints auto-re-running a search if the search contained an old cache id.

This changes the current user experience, but I think would be better than systems becoming unresponsive.

 

NB the above has been observed using the ‘internal’ search methods (rather than Xapian/ElasticSearch).

 

Cheers,

John

 

From: eprints-tech-request@ecs.soton.ac.uk <eprints-tech-request@ecs.soton.ac.uk> On Behalf Of Yuri Carrer
Sent: 18 July 2025 07:24
To: eprints-tech@ecs.soton.ac.uk
Subject: Re: [EP-tech] DDoS on simple and advanced search

 


CAUTION: External Message. Use caution opening links and attachments.


CAUTION: This e-mail originated outside the University of Southampton.

CAUTION: This e-mail originated outside the University of Southampton.

Botnets doesn't use the same IPs.

 

The solution in simpler: rename the search script and update internal links (or let Eprints use some config for it). You can do it weekly, nobody will notice but bots won't be able to keep up with it.

 

Il 17/07/25 19:32, Tomasz Neugebauer ha scritto:

Any comments on that solution?  It seems elegant, if it works?

 

Tomasz

 

__

 
-- 
Yuri Carrer
 
 CAB - Centro di Ateneo per le Biblioteche, Università di Padova
 Tel: 049/827 9712 - Via Beato Pellegrino, 28 - Padova

*** Options: https://wiki.eprints.org/w/Eprints-tech_Mailing_List
*** Archive: https://www.eprints.org/tech.php/
*** EPrints community wiki: https://wiki.eprints.org/