EPrints Technical Mailing List Archive

Message: #00702


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

[EP-tech] Re: Sitemap


Hi Mark, All

   Many thanks for the script, great help in getting a Google friendly sitemap up and running. 

Two supplemental notes Re: our local repository (3.3.7):

If applying the patch component of the sitemap code, I believe it might be necessary to add a clause to the Apache URL handler to permit linking to the originaly style sitemap. After applying the patch, the dynamic sitemap is defined as: 

{Repository URL}/sitemap-sc.xml

however, we noted 404's at that location after applying the patch. The issue seemed to be that requests to Apache for that page are not, by default, being forwarded to the sitemap handler (which would normally be generating that page on-the-fly). We solved the trouble by adding a clause to (~eprints/perl_lib/EPrints/Apache/Rewrite.pm) to catch any sitemap-like page requests and pass them all to the perl module for handling sitemaps:

-- snip (Rewrite.pm) --
        # sitemap.xml (nb. only works if site is in root / of domain.)
        if( $uri =~ m! ^$urlpath/sitemap\.xml$ !x )
        {
                $r->handler( 'perl-script' );

                $r->set_handlers(PerlResponseHandler => \&EPrints::Apache::SiteMap::handler );

                return OK;
        }

# Added modification to handle supplementary sitemaps (sitemap*.xml, -- including sitemap-sc.xml)
# (nb. only works if site is in root / of domain.)
        if( $uri =~ m! ^$urlpath/sitemap[-\w]*\.xml$ !x )
        {
                $r->handler( 'perl-script' );

                $r->set_handlers(PerlResponseHandler => \&EPrints::Apache::SiteMap::handler );

                return OK;
        }
-- snip --

Note 2 Re: the older sitemap. The default Eprints robots.txt excludes access to the cgi directory of eprints, however, the dynamic sitemap is generated as:

{base repository url}/cgi/export/repository/RDFXML/{repository name}.rdf

This could be a bit of an issue for polite crawling robots unless some form of the above url is added as an allow. E.g robots.txt, given repository specific values for {repository URL} and {repository name}:

-- snip --
User-agent: *
Sitemap: http://{repository URL}/sitemap.xml
Allow: /cgi/export/repository/RDFXML/{repository name}.rdf
Disallow: /cgi/
-- snip --

Hopefully the above may prove useful to others working on sitemap bits and pieces.

Cheers,
Casey

________________________________________
From: eprints-tech-bounces@ecs.soton.ac.uk [eprints-tech-bounces@ecs.soton.ac.uk] on behalf of Mark Gregson [mark.gregson@qut.edu.au]
Sent: Wednesday, June 06, 2012 11:12 PM
To: eprints-tech@ecs.soton.ac.uk
Subject: [EP-tech] Re: Sitemap

To all those who asked for the sitemap script I wrote about on the list previously, I'm sorry for the delay in responding but I've just now published the script to files.eprints.org.  It's currently in review and not publicly accessible but when the review is complete you will be able to get it from http://files.eprints.org/774/.

Please let me know how you get on with it, any feedback and suggestions (or patches) will be taken on board but I can't guarantee I'll have time to do anything about it!

Cheers
Mark

Mark Gregson | Applications and Development Team Leader
Library eServices | Queensland University of Technology
Level 2 | R Block | Kelvin Grove Campus | GPO Box 2434 | Brisbane 4001
Phone: +61 7 3138 3782 | Web: http://eprints.qut.edu.au/
ABN: 83 791 724 622
CRICOS No: 00213J


-----Original Message-----
From: eprints-tech-bounces@ecs.soton.ac.uk [mailto:eprints-tech-bounces@ecs.soton.ac.uk] On Behalf Of Centro de Documentación
Sent: Tuesday, 5 June 2012 10:43 AM
To: eprints-tech@ecs.soton.ac.uk
Subject: [EP-tech] Sitemap

Hi,

Can anyone please share a sitemap file or give me some tips on how to create it?

Thanks,

Cristian
*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: http://www.eprints.org/tech.php/
*** EPrints community wiki: http://wiki.eprints.org/

*** Options: http://mailman.ecs.soton.ac.uk/mailman/listinfo/eprints-tech
*** Archive: http://www.eprints.org/tech.php/
*** EPrints community wiki: http://wiki.eprints.org/

This electronic communication is governed by the terms and conditions at
http://www.mun.ca/cc/policies/electronic_communications_disclaimer_2012.php