[EP-tech] Seeing unusually high downloads in IRStats
John Salter
J.Salter at leeds.ac.uk
Tue Jul 26 09:45:10 BST 2016
Hi Betsy,
As these requests do not identify themselves as robots in their User-Agent, it's not as simple as adding a new UA to a list.
The user-agent filtering is done by: EPrints::Plugin::Stats::Filter::Robots (~/lib/plugins/EPrints/Plugin/Stats/Filter/Robots.pm)
I think that you should duplicate this to a new filter:
EPrints::Plugin::Stats::Filter::IP
As the list of bad IPs might be quite dynamic, you might want to make the equivalent of the @ROBOTS into a config variable?
As to the question about applying the new filters to the current dataset, I think you can re-process all the stats - but this may take some time on a busy/established system!
Cheers,
John
From: eprints-tech-bounces at ecs.soton.ac.uk [mailto:eprints-tech-bounces at ecs.soton.ac.uk] On Behalf Of Coles, Elizabeth A. (Betsy)
Sent: 26 July 2016 00:45
To: eprints-tech at ecs.soton.ac.uk
Subject: [EP-tech] Seeing unusually high downloads in IRStats
Forwarding from JISC-REPOSITORIES list - we've been seeing this in California too, and our IRStats2 counts are through the roof for the last couple of weeks.
Can anyone tell me how to filter out these robots in IRStats2? And how to clean the access file so that our irstats2 reports are not distorted by this deluge? I assume I'd want to delete all entries with a requester_id in the table below and rerun IRstats2 setup from scratch.
Thanks,
Betsy Coles
Caltech - Digital Library Development
bcoles at caltech.edu<mailto:bcoles at caltech.edu>
From: Repositories discussion list [mailto:JISC-REPOSITORIES at JISCMAIL.AC.UK] On Behalf Of Hilary Jones
Sent: Friday, July 15, 2016 3:43 AM
To: JISC-REPOSITORIES at JISCMAIL.AC.UK<mailto:JISC-REPOSITORIES at JISCMAIL.AC.UK>
Subject: Seeing unusually high downloads in IRStats - IRUS-UK's explanation and why this isn't affecting IRUS-UK stats
Hi everyone,
There was a discussion, via UKCORR mailing list, on why there are exceptionally high downloads being seen this week in IRStats and what might be causing it.
After some investigation we have found that the unusually high downloads are down to four IP ranges:
IP range
Organisation
Location
No. IP addresses
103.25.156.*
Microsoft Bingbot
China
128
103.36.96.*
Microsoft Corporation
China
216
111.221.28.*
Microsoft Bingbot
China
256
202.89.235.*
Microsoft Bingbot
China
80
These IPs have been systematically trawling and downloading files from many UK repositories. Looking at their User Agent strings they do not declare themselves as bots but masquerade as normal users.
Happily, the IRUS-UK ingest has been filtering out these robotic downloads, so you won't see a massive spike in your IRUS-UK stats.
We hope this is of help.
Best wishes
Hilary
[Jisc]<http://www.jisc.ac.uk/>
Hilary Jones
Services and Projects Support
0161 413 7541
Skype hilary.jones at jisc.ac.uk<mailto:hilary.jones at jisc.ac.uk>
Twitter @JonesHilaryJ
6th Floor Churchgate House, 56 Oxford Street, Manchester, M1 6EU
jisc.ac.uk<http://www.jisc.ac.uk/>
Jisc is a registered charity (number 1149740) and a company limited by guarantee which is registered in England under Company No. 5747339, VAT No. GB 882 5529 90. Jisc's registered office is: One Castlepark, Tower Hill, Bristol, BS2 0JA. T 0203 697 5800. jisc.ac.uk<http://www.jisc.ac.uk/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20160726/62c7e940/attachment.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.jpg
Type: image/jpeg
Size: 1046 bytes
Desc: image001.jpg
Url : http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20160726/62c7e940/attachment.jpg
More information about the Eprints-tech
mailing list