EPrints Technical Mailing List Archive

Message: #01583


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

[EP-tech] Re: RFC access log table


On Fri, 15 Feb 2013 10:30:24 +0000, "Alan.Stiles" <Alan.Stiles@open.ac.uk>
wrote:
> Hi Tim,
> 
> Having a quick look through the access table, it might also be nice if
> there was the option to include / exclude a list of known robots and
> spiders from the csv dumps, and possibly just to strip them from the
> access table outside of the dumps, keeping it to a more manageable size
> without losing 'relevant' information - Bing and Yandex appear to be
among
> our worst offenders.

The robots list we use is from Project COUNTER, but hasn't been updated
since Jan 2011. You can see it here:
https://github.com/eprints/eprints/blob/access_log/perl_lib/EPrints/Apache/LogHandler.pm#L253

The priority for COUNTER appears to be consistency over (necessarily)
accuracy.

I've created two tools, working on this branch (names may change ...):
https://github.com/eprints/eprints/commits/access_log

dump_access
 - write access log entries to CSV files "access_YYYYMM.csv"
 - remove written entries from the database

filter_access
 - re-run the robots filtering based on the LogHandler list
 - filter repeated requests based on a time-window

These use a new CSV exporter I'm working on, but could use the existing
CSV.
(I'm working on a publicly usable CSV export/import, which only operates on
user-importable fields).

/Tim.

> -----Original Message-----
> From: Tim Brody [mailto:tdb2@ecs.soton.ac.uk] 
> Sent: 15 February 2013 09:32
> To: eprints-tech@ecs.soton.ac.uk
> Subject: [EP-tech] Re: RFC access log table
> 
> Hi,
> 
> Yes, there is nothing in the core that relies on data in access*. The
> IRStats 1 & 2 use access to create their summary data.
> 
> It looks like the best solution is to provide a tool to periodically dump
> historic access data to files, but that it is still useful to keep
> "current" (defined by config) data in the database.
> 
> All the best,
> Tim.
> 
> On Fri, 15 Feb 2013 08:13:52 +0100, Yuri <yurj@alfa.it> wrote:
>> We've a test server which is a clone of the production server. Can I 
>> empty those access tables safely to save space? :) can I do an "delete *

>> from access" without any issue? The same for access__ordervalues_en and 
>> all the languages?
>> 
>> Il 15/02/2013 03:13, Mark Gregson ha scritto:
>>> Hi Tim
>>>
>>> Because of the DB backup issues we invested some time a while ago in
> some
>>> scripts for archiving the access data off to monthly dumps and for
>>> restoring it (if required, say be the need to have IRStats reprocess
all
>>> data). These scripts are not actually in production use because I
> haven't
>>> had time to test it to my satisfaction (sorry Nick!).
>>>
>>> CSV is a more accessible format than a MySQL dump, which may be a
>>> benefit.
>>>
>>> We are using IRStats for statistics which uses the access table but I
>>> guess this will be easily updated with a new parser. We also do some
>>> custom logging to the access table for reporting on outbound link
clicks
>>> via IRStats.  This logging is handled via EPrints::Apache::LogHandler.
>>>
>>> Cheers
>>> Mark
>>>
>>>
>>> -----Original Message-----
>>> From: eprints-tech-bounces@ecs.soton.ac.uk
>>> [mailto:eprints-tech-bounces@ecs.soton.ac.uk] On Behalf Of Tim Brody
>>> Sent: Thursday, 14 February 2013 8:01 PM
>>> To: eprints-tech@ecs.soton.ac.uk
>>> Subject: [EP-tech] RFC access log table
>>>
>>> Hi All,
>>>
>>> I'm thinking about the access log table and how it can be made
>>> sustainable.
>>>
>>> What I'm suggesting is to write accesses to CSV-formatted log files,
one
>>> file per month. What I don't know is whether anyone is relying on the
>>> database table for generating statistics?
>>>
>>> The problem the access log table creates is in backing-up the EPrints
>>> database.
>>>
>>> I'd appreciate any thoughts/comments.
>>>
>>> --
>>> All the best,
>>> Tim
>>>
>>> *** Options:
> http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
>>> *** Archive: http://www.eprints.org/tech.php/
>>> *** EPrints community wiki: http://wiki.eprints.org/
>> 
>> *** Options:
http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
>> *** Archive: http://www.eprints.org/tech.php/
>> *** EPrints community wiki: http://wiki.eprints.org/
> 
> -- 
> All the best,
> Tim.
> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
> *** Archive: http://www.eprints.org/tech.php/
> *** EPrints community wiki: http://wiki.eprints.org/

-- 
All the best,
Tim.