[EP-tech] Filtering the access log table

Matthew Brady Matthew.Brady at usq.edu.au
Thu Nov 21 04:50:57 GMT 2013


Hi All,

I was looking for a tool to help align the 3.2 data to 3.3 filtered, and came across an email from Tim Brody, re: tool he wrote to do just that...
<filter_access> - apply 3.3 filters to 3.2 data --> https://github.com/eprints/eprints/blob/68c666e9d32d2c3db864a0e345f2ee14608258a1/tools/filter_access  


The dry run for both methods provided counts for what would be removed... 

[eprints at eprints30310 eprints3]$ /opt/eprints3/bin/filter_access --dry-run --verbose <repoid> user_agent
removed 3458792 of 16060930 records

[eprints at eprints30310 eprints3]$ /opt/eprints3/bin/filter_access --dry-run --verbose <repoid> repeated
removed 1704689 of 16060930 records


The real run was somewhat more enthusiastic.. :) and removed every item... Thankfully its been backed up just in case.....

[eprints at eprints30310 eprints3]$ time /opt/eprints3/bin/filter_access --verbose <repoid> user_agent
Removed 3458792 of 16060930

real    159m49.981s
user    100m33.096s
sys     6m2.058s

an SQL query verifies it...

mysql> select count(*) from access;
+----------+
| count(*) |
+----------+
|        0   |
+----------+
1 row in set (0.00 sec)


Has anyone else encountered this before? Or point me where/how to diagnose the incorrect logic?

There were a few entries in the access table that had user agent = null, which caused the following error, for each robot in the list, for each null entry.. 

Use of uninitialized value in pattern match (m//) at /opt/eprints3/bin/filter_access line 116.

But I wouldn't have expected that to break things so badly that every record would be deleted...

Any help would be appreciated

Thanks

Matt



-----Original Message-----
From: eprints-tech-bounces at ecs.soton.ac.uk [mailto:eprints-tech-bounces at ecs.soton.ac.uk] On Behalf Of Tim Brody
Sent: Saturday, 16 February 2013 12:01 AM
To: eprints-tech at ecs.soton.ac.uk
Subject: [EP-tech] Re: RFC access log table

On Fri, 15 Feb 2013 10:30:24 +0000, "Alan.Stiles" <Alan.Stiles at open.ac.uk>
wrote:
> Hi Tim,
> 
> Having a quick look through the access table, it might also be nice if 
> there was the option to include / exclude a list of known robots and 
> spiders from the csv dumps, and possibly just to strip them from the 
> access table outside of the dumps, keeping it to a more manageable 
> size without losing 'relevant' information - Bing and Yandex appear to 
> be
among
> our worst offenders.

The robots list we use is from Project COUNTER, but hasn't been updated since Jan 2011. You can see it here:
https://github.com/eprints/eprints/blob/access_log/perl_lib/EPrints/Apache/LogHandler.pm#L253

The priority for COUNTER appears to be consistency over (necessarily) accuracy.

I've created two tools, working on this branch (names may change ...):
https://github.com/eprints/eprints/commits/access_log

dump_access
 - write access log entries to CSV files "access_YYYYMM.csv"
 - remove written entries from the database

filter_access
 - re-run the robots filtering based on the LogHandler list
 - filter repeated requests based on a time-window

These use a new CSV exporter I'm working on, but could use the existing CSV.
(I'm working on a publicly usable CSV export/import, which only operates on user-importable fields).

/Tim.



_____________________________________________________________
This email (including any attached files) is confidential and is for the intended recipient(s) only. If you received this email by mistake, please, as a courtesy, tell the sender, then delete this email.

The views and opinions are the originator's and do not necessarily reflect those of the University of Southern Queensland. Although all reasonable precautions were taken to ensure that this email contained no viruses at the time it was sent we accept no liability for any losses arising from its receipt.

The University of Southern Queensland is a registered provider of education with the Australian Government.
(CRICOS Institution Code QLD 00244B / NSW 02225M, TEQSA PRV12081 )




More information about the Eprints-tech mailing list