James Cridland

Fixing 404 errors and link rot, while maintaining authenticity

A 404 error

A web address is for life: and, ideally, should always work.

On May 30 2017, I published the first edition of the Podnews podcast newsletter (to nobody, since I had no subscribers), and the web version lives at the end of this URL: https://podnews.net/update/update-for-castro.

Podnews is a link newsletter. Stories are a short one or two sentence summary, with a link to the story wherever it was published.

And it turns out that, while I believe a web address is for life, that’s not really what happens in real life.

Two links on this page are broken and don’t work:

  • One links to a story about Whooshkaa on another website, which no longer appears to exist
  • One is the casualty of a website redesign, as Amplifi Media has reworked its website

You’d assume that big websites like BBC News, The Guardian or The New York Times, which links out regularly, would have some sort of policy of how they deal with link rot, but I couldn’t find anything after a Google search. I’d be fascinated to see something, but haven’t found an awful lot.

So, here’s what I’m doing. I wonder if it’s the right thing.


Every so often (likely to be no more often than a year), I’ll crawl old archive pages.

Using a specific user agent, the code will do a HEAD request. If the HTTP code that I get back is above 399, then it’s an error, and I assume the page no longer works.

I write everything in PHP, so I’m just using get_headers().


The Internet Archive

After thinking aloud on Mastodon, Paul Riismandel suggested that I link to the Internet Archive version of the story if the link doesn’t work. This was an excellent idea.

I discovered a really useful Wayback Machine API which can find a snapshot of the page that I’d like to link to.

A call to https://archive.org/wayback/available?url=example.com&timestamp=20060101 checks whether the page exists in the Internet Archive, and returns the closest snapshot to the date given. This is really useful, because it allows me not just to link to an archived copy of the page, but to how the page looked on the day (or close to) that I was writing about it and visited it.

So, now I’ve programmatically found (probably) a working version of the page I linked to. Hurray!

But what then?


Marking my work

I don’t really want to change the historical accuracy of a piece I wrote five years ago. Just popping along and changing the URL, even to the Internet Archive, is probably not the right course of action, therefore; because I’m breaking the authenticity of the story I wrote five years ago.

But I also want the link, if a researcher finds it, to work. So… what to do.

After a bit of thought, I’m not changing the link in my archive: because to do that would, I think, be the wrong thing to do. I didn’t link to the Internet Archive five years ago, after all. So I wanted a simple method of highlighting broken links and replacing them in a non-destrutive way.

The answer, it seems, is… to inject an HTML comment in the text, which I do at the end of the paragraph that the link is in. That lets me programmatically add information in - and remove it - without a massively complicated database of links.

Here’s what the comment looks like in the link about Whooskaa (with some linebreaks in it to help):

<!-- LINKCHECK|
20230722|
403|
https://www.radioinfo.com.au/news/don%E2%80%99t-have-time-read-news-listen-it-whooshkaa|
http://web.archive.org/web/20170516181006/https://www.radioinfo.com.au/news/don%E2%80%99t-have-time-read-news-listen-it-whooshkaa|
-->

The | delimiter here is used because it’s not a valid character in a URL. I’m hoping it won’t turn up, therefore, in any URL I’ve linked to, because it’ll break things if so. 😬

Anyway, this HTML comment starts with LINKCHECK and then has the date it was checked, what the HTTP result was (a 403 Forbidden in this case), the original URL and then, if one exists, a link to the archive.

This version of the article text is then saved in the database.


When it comes to grabbing the page from the database, we have two use-cases for broken links. I wonder if I’ve done these right:

Link to the archive

I’m replacing the link in the HTML with the link to the Internet Archive, and displaying a note on the story that the link now goes to the Internet Archive. I’m hoping that’s the right thing to do here. Two examples are in my first edition.

I don’t want to link to the 404’d page directly any more.

Broken link I don’t want to link to the 404’d page directly any more, but I do want people to see a) that this link doesn’t work any more, and b) where the link was going in case they can find it with their own research.

So, I’m putting a line though the link text, and a note saying that the link doesn’t work.

But I’m also wanting to change the URL to something that shouldn’t kick off an error but will hopefully help someone. There doesn’t appear to be any prior art for this, so the solution I’ve plumped for is to link to:

about:blank?was:https://example.com/broken-link

…so someone following the link will be given their browser’s standard blank page, but if they’re looking closely in their browser bar, they’ll see the original URL it was linking to.

You’ll see an example of this link in a post I made a few days later.


Is this the right way to do this? I don’t really know, but I hope it is.

Are there any examples of link rot policies published anywhere?

Previously...

Next...