EPrints Technical Mailing List Archive

Message: #05894


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

Re: [EP-tech] Seeing unusually high downloads in IRStats


I have not heard anything back from Bing Webmaster Support on this, but I can report our internal progress.

I have confidence that for us, between July 11 and July 22 Bing was generating inappropriately counted accesses from:
	202.89.235.0/24
	111.221.28.0/24
	103.36.96.0/24
with the following User Agent strings:
Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0
Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0 
Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36 
Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0 
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36 
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0 

I would be curious to know if others have found additional IPs or User Agents which you have flagged but I have missed.

I would also be curious to know how others have filtered this (or are planning to filter this), since there appear to be legitimate accesses with these user agent strings as well.

Finally, if anyone has a cheat sheet on how to clear the irstats2 tables for just the month of July so that process_stats can regenerate the data, I would appreciate it.  Thanks!

Enjoy,

- Clinton Graham
Systems Developer
University of Pittsburgh | University Library System
412-383-1057

-----Original Message-----
From: eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk] On Behalf Of Graham, Clinton T
Sent: Tuesday, July 26, 2016 10:23 AM
To: eprints-tech@ecs.soton.ac.uk
Subject: Re: [EP-tech] Seeing unusually high downloads in IRStats

What do you propose that User Agent match be?  We found each of the following coming from Bing, among others:
Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0
Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36
Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0
Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0

We requested that Bing Support describe any existing pattern for identification, or requested they comply with RFC2616 14.22's use of the From header in such a way that we could recommend to Project COUNTER that this be considered for bot identification.

Enjoy,

- Clinton Graham
Systems Developer
University of Pittsburgh | University Library System
412-383-1057

-----Original Message-----
From: eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk] On Behalf Of Yuri
Sent: Tuesday, July 26, 2016 9:21 AM
To: eprints-tech@ecs.soton.ac.uk
Subject: Re: [EP-tech] Seeing unusually high downloads in IRStats

With Apache:

RewriteEngine On
RewriteCond %{HTTP:User-Agent} 
(?:Yandex|msnbot|Owlinbo|sistrix|genieo|proximic|MJ12bot|AhrefsBot|searchmetrics|SearchmetricsBot|Baidu) 
[NC]
RewriteRule .? - [F]

just add the guilty.

Problem solved :-D

Il 26/07/2016 14:13, Graham, Clinton T ha scritto:
>
> The University of Pittsburgh opened ticket UCM000000270852 with Bing 
> Webmaster Support last week regarding this and received the following 
> response:
>
> Thank you for contacting Bing Webmaster Support.  The activity you are 
> seeing is most likely caused by one of our bots used for verifying 
> your site rather than indexing your site as Bingbot does.  These 
> crawlers do not have the same UA, and are in place to make sure the 
> verification aspects of your site are in place.
>
> Yesterday, we requested additional information on what "verification" 
> really means, and describe the problem of conflating user-generated 
> activity with bot-generated activity, especially for the scholarly 
> publication process.
>
> I'll reply again here if this support request goes anywhere, but 
> perhaps others might be interested in similarly engaging Bing 
> Webmaster Support?
>
> Enjoy,
>
> - Clinton Graham
>
> Systems Developer
>
> University of Pittsburgh | University Library System
>
> 412-383-1057
>
> *From:*eprints-tech-bounces@ecs.soton.ac.uk 
> [mailto:eprints-tech-bounces@ecs.soton.ac.uk] *On Behalf Of *Coles, 
> Elizabeth A. (Betsy)
> *Sent:* Monday, July 25, 2016 7:45 PM
> *To:* eprints-tech@ecs.soton.ac.uk
> *Subject:* [EP-tech] Seeing unusually high downloads in IRStats
>
> Forwarding from JISC-REPOSITORIES list - we've been seeing this in 
> California too, and our IRStats2 counts are through the roof for the 
> last couple of weeks.
>
> Can anyone tell me how to filter out these robots in IRStats2?  And 
> how to clean the access file so that our irstats2 reports are not 
> distorted by this deluge?  I assume I'd want to delete all entries 
> with a requester_id in the table below and rerun IRstats2 setup from 
> scratch.
>
> Thanks,
>
> Betsy Coles
>
> Caltech - Digital Library Development
>
> bcoles@caltech.edu <mailto:bcoles@caltech.edu>
>
> *From:* Repositories discussion list 
> [mailto:JISC-REPOSITORIES@JISCMAIL.AC.UK] *On Behalf Of *Hilary Jones
> *Sent:* Friday, July 15, 2016 3:43 AM
> *To:* JISC-REPOSITORIES@JISCMAIL.AC.UK 
> <mailto:JISC-REPOSITORIES@JISCMAIL.AC.UK>
> *Subject:* Seeing unusually high downloads in IRStats - IRUS-UK's 
> explanation and why this isn't affecting IRUS-UK stats
>
> Hi everyone,
>
> There was a discussion, via UKCORR mailing list, on why there are 
> exceptionally high downloads being seen this week in IRStats and what 
> might be causing it.
>
> After some investigation we have found that the unusually high 
> downloads are down to four IP ranges:
>
> IP range
>
> 	
>
> Organisation
>
> 	
>
> Location
>
> 	
>
> No. IP addresses
>
> 103.25.156.*
>
> 	
>
> Microsoft Bingbot
>
> 	
>
> China
>
> 	
>
> 128
>
> 103.36.96.*
>
> 	
>
> Microsoft Corporation
>
> 	
>
> China
>
> 	
>
> 216
>
> 111.221.28.*
>
> 	
>
> Microsoft Bingbot
>
> 	
>
> China
>
> 	
>
> 256
>
> 202.89.235.*
>
> 	
>
> Microsoft Bingbot
>
> 	
>
> China
>
> 	
>
> 80
>
> These IPs have been systematically trawling and downloading files from 
> many UK repositories. Looking at their User Agent strings they do not 
> declare themselves as bots but masquerade as normal users.
>
> Happily, the IRUS-UK ingest has been filtering out these robotic 
> downloads, so you won't see a massive spike in your IRUS-UK stats.
>
> We hope this is of help.
>
> Best wishes
>
> Hilary
>
> Jisc 
> <https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fwww.jisc.ac.uk%2f&data=01%7c01%7cctgraham%40pitt.edu%7cc90cb3f4da52477f805508d3b4e65fe1%7c9ef9f489e0a04eeb87cc3a526112fd0d%7c1&sdata=QO%2bCO4aO%2b4wNHbglnWa6s4IinzrhqbxzUGL5ieuMq5E%3d>
>
> *Hilary Jones*
> Services and Projects Support
>
> 0161 413 7541
> Skype hilary.jones@jisc.ac.uk <mailto:hilary.jones@jisc.ac.uk>
> Twitter @JonesHilaryJ
> 6th Floor Churchgate House, 56 Oxford Street, Manchester, M1  6EU
>
> *jisc.ac.uk 
> <https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fwww.jisc.ac.uk%2f&data=01%7c01%7cctgraham%40pitt.edu%7cc90cb3f4da52477f805508d3b4e65fe1%7c9ef9f489e0a04eeb87cc3a526112fd0d%7c1&sdata=QO%2bCO4aO%2b4wNHbglnWa6s4IinzrhqbxzUGL5ieuMq5E%3d> 
> *
>
> Jisc is a registered charity (number 1149740) and a company limited by 
> guarantee which is registered in England under Company No. 5747339, 
> VAT No. GB 882 5529 90. Jisc's registered office is: One Castlepark, 
> Tower Hill, Bristol, BS2 0JA. T 0203 697 5800. jisc.ac.uk 
> <https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fwww.jisc.ac.uk%2f&data=01%7c01%7cctgraham%40pitt.edu%7cc90cb3f4da52477f805508d3b4e65fe1%7c9ef9f489e0a04eeb87cc3a526112fd0d%7c1&sdata=QO%2bCO4aO%2b4wNHbglnWa6s4IinzrhqbxzUGL5ieuMq5E%3d>
>
>
>
> *** Options: https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fmailman.ecs.soton.ac.uk%2fmailman%2flistinfo%2feprints-tech&data=01%7c01%7cctgraham%40pitt.edu%7cfa3c2de61e1549c3314e08d3b5587b28%7c9ef9f489e0a04eeb87cc3a526112fd0d%7c1&sdata=Ehu39hyCMWRVOCRKkKklceTfE%2f%2fkg42Pfzm0wbri09Y%3d
> *** Archive: https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fwww.eprints.org%2ftech.php%2f&data=01%7c01%7cctgraham%40pitt.edu%7cfa3c2de61e1549c3314e08d3b5587b28%7c9ef9f489e0a04eeb87cc3a526112fd0d%7c1&sdata=V6N4nro4zLCpORRsY9pXdQl6DPfNatw0rDArihFMrgY%3d
> *** EPrints community wiki: https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fwiki.eprints.org%2f&data=01%7c01%7cctgraham%40pitt.edu%7cfa3c2de61e1549c3314e08d3b5587b28%7c9ef9f489e0a04eeb87cc3a526112fd0d%7c1&sdata=MgG4kKoc%2fdA02Fp2EIC3TUqlmiKO46QH0gxocexaX5U%3d
> *** EPrints developers Forum: https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fforum.eprints.org%2f&data=01%7c01%7cctgraham%40pitt.edu%7cfa3c2de61e1549c3314e08d3b5587b28%7c9ef9f489e0a04eeb87cc3a526112fd0d%7c1&sdata=4yAgurdLBbTi005%2fDcW74cNSOYyiTbbx%2f6MfusHVCPg%3d

*** Options: https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fmailman.ecs.soton.ac.uk%2fmailman%2flistinfo%2feprints-tech&data=01%7c01%7cctgraham%40pitt.edu%7cfa3c2de61e1549c3314e08d3b5587b28%7c9ef9f489e0a04eeb87cc3a526112fd0d%7c1&sdata=Ehu39hyCMWRVOCRKkKklceTfE%2f%2fkg42Pfzm0wbri09Y%3d
*** Archive: https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fwww.eprints.org%2ftech.php%2f&data=01%7c01%7cctgraham%40pitt.edu%7cfa3c2de61e1549c3314e08d3b5587b28%7c9ef9f489e0a04eeb87cc3a526112fd0d%7c1&sdata=V6N4nro4zLCpORRsY9pXdQl6DPfNatw0rDArihFMrgY%3d
*** EPrints community wiki: https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fwiki.eprints.org%2f&data=01%7c01%7cctgraham%40pitt.edu%7cfa3c2de61e1549c3314e08d3b5587b28%7c9ef9f489e0a04eeb87cc3a526112fd0d%7c1&sdata=MgG4kKoc%2fdA02Fp2EIC3TUqlmiKO46QH0gxocexaX5U%3d
*** EPrints developers Forum: https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fforum.eprints.org%2f&data=01%7c01%7cctgraham%40pitt.edu%7cfa3c2de61e1549c3314e08d3b5587b28%7c9ef9f489e0a04eeb87cc3a526112fd0d%7c1&sdata=4yAgurdLBbTi005%2fDcW74cNSOYyiTbbx%2f6MfusHVCPg%3d

*** Options: https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fmailman.ecs.soton.ac.uk%2fmailman%2flistinfo%2feprints-tech&data=01%7c01%7cctgraham%40pitt.edu%7cfd86352f6ab24220c9b608d3b5610899%7c9ef9f489e0a04eeb87cc3a526112fd0d%7c1&sdata=%2f4OgGFXFYsEkpvLxq6Av0y651dEP63ASES0jDDLWNG4%3d
*** Archive: https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fwww.eprints.org%2ftech.php%2f&data=01%7c01%7cctgraham%40pitt.edu%7cfd86352f6ab24220c9b608d3b5610899%7c9ef9f489e0a04eeb87cc3a526112fd0d%7c1&sdata=ytdVwo%2bIV1qa7lEuXiWSjXgG2SCwnMnI6%2bY%2ftwSI%2bpg%3d
*** EPrints community wiki: https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fwiki.eprints.org%2f&data=01%7c01%7cctgraham%40pitt.edu%7cfd86352f6ab24220c9b608d3b5610899%7c9ef9f489e0a04eeb87cc3a526112fd0d%7c1&sdata=NPVhj2zJOWLZcXVFrOCfRQo8kser9poJkRFz%2bk%2bDDY4%3d
*** EPrints developers Forum: https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fforum.eprints.org%2f&data=01%7c01%7cctgraham%40pitt.edu%7cfd86352f6ab24220c9b608d3b5610899%7c9ef9f489e0a04eeb87cc3a526112fd0d%7c1&sdata=MdIrHT8bTXM1tvZO77MSeB6N4EQnIsXkPR2y3hqkWok%3d