[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[EP-tech] Seeing unusually high downloads in IRStats



Hi All,

If I had to guess, they are 'verifying' that what the bots retrieve, is what the general public retrieve...
There have been many cases of over-eager SEO 'Experts' dishing up one set of content to search bots, and then serving up something else to non-bot requests (e.g spam, malicious content...)

The only way to do that is to come in a second time, and have nothing 'botlike' about the request, and compare the results.

Cheers

Matt.

From: eprints-tech-bounces at ecs.soton.ac.uk [mailto:eprints-tech-bounces at ecs.soton.ac.uk] On Behalf Of Graham, Clinton T
Sent: Tuesday, 26 July 2016 10:14 PM
To: eprints-tech at ecs.soton.ac.uk
Subject: Re: [EP-tech] Seeing unusually high downloads in IRStats

The University of Pittsburgh opened ticket UCM000000270852 with Bing Webmaster Support last week regarding this and received the following response:
Thank you for contacting Bing Webmaster Support.  The activity you are seeing is most likely caused by one of our bots used for verifying your site rather than indexing your site as Bingbot does.  These crawlers do not have the same UA, and are in place to make sure the verification aspects of your site are in place.

Yesterday, we requested additional information on what "verification" really means, and describe the problem of conflating user-generated activity with bot-generated activity, especially for the scholarly publication process.

I'll reply again here if this support request goes anywhere, but perhaps others might be interested in similarly engaging Bing Webmaster Support?

Enjoy,

- Clinton Graham
Systems Developer
University of Pittsburgh | University Library System
412-383-1057

From: eprints-tech-bounces at ecs.soton.ac.uk<mailto:eprints-tech-bounces at ecs.soton.ac.uk> [mailto:eprints-tech-bounces at ecs.soton.ac.uk] On Behalf Of Coles, Elizabeth A. (Betsy)
Sent: Monday, July 25, 2016 7:45 PM
To: eprints-tech at ecs.soton.ac.uk<mailto:eprints-tech at ecs.soton.ac.uk>
Subject: [EP-tech] Seeing unusually high downloads in IRStats

Forwarding from JISC-REPOSITORIES list - we've been seeing this in California too, and our IRStats2 counts are through the roof for the last couple of weeks.

Can anyone tell me how to filter out these robots in IRStats2?  And how to clean the access file so that our irstats2 reports are not distorted by this deluge?  I assume I'd want to delete all entries with a requester_id in the table below and rerun IRstats2 setup from scratch.

Thanks,
Betsy Coles
Caltech - Digital Library Development
bcoles at caltech.edu<mailto:bcoles at caltech.edu>

From: Repositories discussion list [mailto:JISC-REPOSITORIES at JISCMAIL.AC.UK] On Behalf Of Hilary Jones
Sent: Friday, July 15, 2016 3:43 AM
To: JISC-REPOSITORIES at JISCMAIL.AC.UK<mailto:JISC-REPOSITORIES at JISCMAIL.AC.UK>
Subject: Seeing unusually high downloads in IRStats - IRUS-UK's explanation and why this isn't affecting IRUS-UK stats

Hi everyone,

There was a discussion, via UKCORR mailing list, on why there are exceptionally high downloads being seen this week in IRStats and what might be causing it.

After some investigation we have found that the unusually high downloads are down to four IP ranges:

IP range

Organisation

Location

No. IP addresses

103.25.156.*

Microsoft Bingbot

China

128

103.36.96.*

Microsoft Corporation

China

216

111.221.28.*

Microsoft Bingbot

China

256

202.89.235.*

Microsoft Bingbot

China

80


These IPs have been systematically trawling and downloading files from many UK repositories. Looking at their User Agent strings they do not declare themselves as bots but masquerade as normal users.

Happily, the IRUS-UK ingest has been filtering out these robotic downloads, so you won't see a massive spike in your IRUS-UK stats.

We hope this is of help.

Best wishes
Hilary

[Jisc]<https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fwww.jisc.ac.uk%2f&data=01%7c01%7cctgraham%40pitt.edu%7cc90cb3f4da52477f805508d3b4e65fe1%7c9ef9f489e0a04eeb87cc3a526112fd0d%7c1&sdata=QO%2bCO4aO%2b4wNHbglnWa6s4IinzrhqbxzUGL5ieuMq5E%3d>

Hilary Jones
Services and Projects Support

0161 413 7541
Skype hilary.jones at jisc.ac.uk<mailto:hilary.jones at jisc.ac.uk>
Twitter @JonesHilaryJ
6th Floor Churchgate House, 56 Oxford Street, Manchester, M1  6EU

jisc.ac.uk<https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fwww.jisc.ac.uk%2f&data=01%7c01%7cctgraham%40pitt.edu%7cc90cb3f4da52477f805508d3b4e65fe1%7c9ef9f489e0a04eeb87cc3a526112fd0d%7c1&sdata=QO%2bCO4aO%2b4wNHbglnWa6s4IinzrhqbxzUGL5ieuMq5E%3d>

Jisc is a registered charity (number 1149740) and a company limited by guarantee which is registered in England under Company No. 5747339, VAT No. GB 882 5529 90. Jisc's registered office is: One Castlepark, Tower Hill, Bristol, BS2 0JA. T 0203 697 5800. jisc.ac.uk<https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fwww.jisc.ac.uk%2f&data=01%7c01%7cctgraham%40pitt.edu%7cc90cb3f4da52477f805508d3b4e65fe1%7c9ef9f489e0a04eeb87cc3a526112fd0d%7c1&sdata=QO%2bCO4aO%2b4wNHbglnWa6s4IinzrhqbxzUGL5ieuMq5E%3d>





_____________________________________________________________
This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email.

The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt.

The University of Southern Queensland is a registered provider of education with the Australian Government.
(CRICOS Institution Code QLD 00244B / NSW 02225M, TEQSA PRV12081 )

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20160726/cb2a9b4f/attachment-0001.html 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 1046 bytes
Desc: image001.jpg
Url : http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20160726/cb2a9b4f/attachment-0001.jpg