[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[EP-tech] Re: RFC access log table
On Fri, 15 Feb 2013 10:30:24 +0000, "Alan.Stiles" <Alan.Stiles at open.ac.uk>
> Hi Tim,
> Having a quick look through the access table, it might also be nice if
> there was the option to include / exclude a list of known robots and
> spiders from the csv dumps, and possibly just to strip them from the
> access table outside of the dumps, keeping it to a more manageable size
> without losing 'relevant' information - Bing and Yandex appear to be
> our worst offenders.
The robots list we use is from Project COUNTER, but hasn't been updated
since Jan 2011. You can see it here:
The priority for COUNTER appears to be consistency over (necessarily)
I've created two tools, working on this branch (names may change ...):
- write access log entries to CSV files "access_YYYYMM.csv"
- remove written entries from the database
- re-run the robots filtering based on the LogHandler list
- filter repeated requests based on a time-window
These use a new CSV exporter I'm working on, but could use the existing
(I'm working on a publicly usable CSV export/import, which only operates on
> -----Original Message-----
> From: Tim Brody [mailto:tdb2 at ecs.soton.ac.uk]
> Sent: 15 February 2013 09:32
> To: eprints-tech at ecs.soton.ac.uk
> Subject: [EP-tech] Re: RFC access log table
> Yes, there is nothing in the core that relies on data in access*. The
> IRStats 1 & 2 use access to create their summary data.
> It looks like the best solution is to provide a tool to periodically dump
> historic access data to files, but that it is still useful to keep
> "current" (defined by config) data in the database.
> All the best,
> On Fri, 15 Feb 2013 08:13:52 +0100, Yuri <yurj at alfa.it> wrote:
>> We've a test server which is a clone of the production server. Can I
>> empty those access tables safely to save space? :) can I do an "delete *
>> from access" without any issue? The same for access__ordervalues_en and
>> all the languages?
>> Il 15/02/2013 03:13, Mark Gregson ha scritto:
>>> Hi Tim
>>> Because of the DB backup issues we invested some time a while ago in
>>> scripts for archiving the access data off to monthly dumps and for
>>> restoring it (if required, say be the need to have IRStats reprocess
>>> data). These scripts are not actually in production use because I
>>> had time to test it to my satisfaction (sorry Nick!).
>>> CSV is a more accessible format than a MySQL dump, which may be a
>>> We are using IRStats for statistics which uses the access table but I
>>> guess this will be easily updated with a new parser. We also do some
>>> custom logging to the access table for reporting on outbound link
>>> via IRStats. This logging is handled via EPrints::Apache::LogHandler.
>>> -----Original Message-----
>>> From: eprints-tech-bounces at ecs.soton.ac.uk
>>> [mailto:eprints-tech-bounces at ecs.soton.ac.uk] On Behalf Of Tim Brody
>>> Sent: Thursday, 14 February 2013 8:01 PM
>>> To: eprints-tech at ecs.soton.ac.uk
>>> Subject: [EP-tech] RFC access log table
>>> Hi All,
>>> I'm thinking about the access log table and how it can be made
>>> What I'm suggesting is to write accesses to CSV-formatted log files,
>>> file per month. What I don't know is whether anyone is relying on the
>>> database table for generating statistics?
>>> The problem the access log table creates is in backing-up the EPrints
>>> I'd appreciate any thoughts/comments.
>>> All the best,
>>> *** Options:
>>> *** Archive: http://www.eprints.org/tech.php/
>>> *** EPrints community wiki: http://wiki.eprints.org/
>> *** Options:
>> *** Archive: http://www.eprints.org/tech.php/
>> *** EPrints community wiki: http://wiki.eprints.org/
> All the best,
> *** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
> *** Archive: http://www.eprints.org/tech.php/
> *** EPrints community wiki: http://wiki.eprints.org/
All the best,