[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[EP-tech] Re: IRStats / Access log issue



Hi All,

   As a followup to the previous issue, I've added a clause to the Apache logging module on our repository to prevent null user-agents from being stored in the access table. The snippet below was included in "sub _generic" of (~eprints/perl_lib/EPrints/Apache/LogHandler.pm) and prevents the looping behaviour by replacing null user agent values with the string "Unknown". It should serve to prevent the IRStats functions from endlessly looping while waiting for a request with a non-null user agent.

----- snip -----
        $epdata->{datestamp} = EPrints::Time::get_iso_timestamp( $r->request_time );
        $epdata->{requester_id} = $ip;
        $epdata->{referring_entity_id} = $r->headers_in->{ "Referer" };
        $epdata->{requester_user_agent} = $r->headers_in->{ "User-Agent" };

##### Add wrapper to null user agents to prevent IRStats issues with invalid user agent (i.e. sets undef as "Unknown")
        if ((! defined($epdata->{requester_user_agent})) || ($epdata->{requester_user_agent} eq  ''))
        {
                $epdata->{requester_user_agent} = 'Unknown';
        }
#####

        # Sanity check referring URL (don't store non-HTTP referrals)
        if( !$epdata->{referring_entity_id} || $epdata->{referring_entity_id} !~ /^https?:/ )
        {
                $epdata->{referring_entity_id} = '';
        }

----- snip -----

Cheers,
Casey

________________________________
From: eprints-tech-bounces at ecs.soton.ac.uk [eprints-tech-bounces at ecs.soton.ac.uk] on behalf of rchilliard at mun.ca [rchilliard at mun.ca]
Sent: Wednesday, June 06, 2012 11:32 AM
To: eprints-tech at ecs.soton.ac.uk
Subject: [EP-tech] IRStats / Access log issue

Hi All,

   I think I've found a potential issue which may affect users of the IRStats module, also relating to the access logging components of EPRints. I noted the issue after a status monitor on our repository indicated an extended period of very high transaction rate to the back-end MySQL server. The issue is exposed via a loop in the update subroutine in the IRStats Access.pm module which migrates access counts from the eprints access log table over to the main stats table. In particular, the loop segment which iterates:

sub update
{
...
        # Do chunks of 100,000 records because we can potentially be dealing with
        # millions of records
        for(my $accessid = $highest_destination_access_id; $accessid < $highest_source_access_id;)
        {
                $session->log("Processing from $accessid to $highest_source_access_id");

##because it's the first update, do twice
                $sql = "SELECT * FROM " . $database->quote_identifier($source_table) . " WHERE " .
                        $database->quote_identifier('accessid') . " > $accessid ORDER BY " .
                        $database->quote_identifier('accessid') . " ASC LIMIT 100000";
                $query = $database->do_sql($sql);

                while (my $row = $query->fetchrow_hashref()){

                        next unless valid_accesslog_entry($row);
                        my %hit = %$row;
                        $accessid = $hit{accessid};
...

   Across both loops, the $accessid value is only updated if the current fetched row is valid as per the valid_accesslog_entry() subroutine. This is generally true, however, we have noted some access hits (rightly or wrongly) come from sources with masked or empty useragent values. These values appear to be stored in the access table with NULL values for requester_user_agent, which, when returned as 'undef' by row_hashref(), causes the valid_accesslog_entry() to fail. If the last record in a subset to be migrated ($accessid == ($hightest_source_access_id -1)) is such a record, the $accessid value will not be rolled over, not allowing exit from the outer loop until a further page access is made by a client with a valid useragent. While stuck in the loop, the sql query is repeatedly called, hammering the back-end database. As a resolution step, I'm looking at adding sanity checking to all stored access values being written to the DB (./perl_lib/EPrints/Apache/LogHandler.pm:_create_access()), though I'm interested to know if there might be a less invasive fix that might be carried forward across upgrades?

Cheers,
Casey

________________________________
Casey Hilliard
PC Consultant,
Health Sciences Library / QE2 Systems,
Memorial University
Phone: 709-777-2387 (HSL)
Phone: 709-864-6267 (QE2)

This electronic communication is governed by the terms and conditions at http://www.mun.ca/cc/policies/electronic_communications_disclaimer_2011.php

This electronic communication is governed by the terms and conditions at http://www.mun.ca/cc/policies/electronic_communications_disclaimer_2012.php

This electronic communication is governed by the terms and conditions at
http://www.mun.ca/cc/policies/electronic_communications_disclaimer_2012.php
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/attachments/20120618/af7bdae6/attachment-0001.html