EPrints Technical Mailing List Archive

Message: #06425


< Previous (by date) | Next (by date) > | < Previous (in thread) | Next (in thread) > | Messages - Most Recent First | Threads - Most Recent First

[EP-tech] Linkcheck


Hi,

I just wrote a linkcheck crawler that checks the remote URLs stored in an EPrints repo and updates the issues list for URLs that have an invalid format or report HTTP status codes other than 200.
Please let me know if there is an interest to have it available, then I will put it on GitHub. There's some more work to do, e.g. move some of the methods to a plugin so that they can be called from elsewhere.

Please also be aware that by applying a linkcheck crawler your editorial team may come under strain to fix all the dead links. Our initial run revealed that after 10 years of running our repository, about 25% of the URLs (about 7500 in our case) are now working anymore.

The script also produces a report by HTTP status code and that is sorted either by eprint id or by URL. The latter allows to identify patterns so that URLs can be replaced or removed in batch.

Best regards,

Martin

--
Dr. Martin Brändle
Zentrale Informatik
Universität Zürich
Stampfenbachstr. 73
CH-8006 Zürich