James Cridland

Generating podcast thumbnail images, caching and changing them, with AWS

What I did at first

  • You ask for “The Daily”’s podcast thumbnail.
  • I go to the iTunes API to find out where the image is
  • I link to Apple’s copy in my HTML

This is stealing Tim Apple’s bandwidth, which is a bit naughty. It also gives Apple all the IP/user-agent details of my visitors, which is naughtier. So, I changed to:

Using Cloudfront to cache

  • You ask for “The Daily”’s podcast thumbnail.
  • I go to the iTunes API to find out where the image is.
  • I automatically download and dynamically resize a copy, and hope Cloudfront caches the response for a long time

But there are many hundreds of caches for Cloudfront, and you’re not guaranteed that they’ll cache anyway. So a significant amount of my server time was spent getting images and resizing them. This is bad, too. But much better for privacy. So I changed to:

Using S3 to cache

  • You ask for “The Daily”’s podcast thumbnail.
  • I tell you it’s at example.com/images/123456.jpg (where 123456 is the Apple ID).
  • This is an address on my Cloudfront distribution, pointing to my S3 bucket.
  • If the file isn’t there (a 404), S3 will automatically forward the request to my server.
  • Armed with the Apple ID, it goes to the iTunes API to find out where the image is
  • It both dynamically produces a resized version in the browser; and uploads it to the correct location in my S3 bucket.
  • Next time you ask, my Cloudfront distribution will automatically grab it out of the S3 bucket.

This is a much better plan. It means that my little server only ever produces an image once — even if there are hundreds of caches for Cloudfront. And even if the cache runs out, it doesn’t matter, since it’ll just retrieve the file from S3.

In fact, this is too good a plan. Because it means that the artwork never changes. And podcasting artwork does change: rather more often than you’d think.

It relies on the Apple iTunes API, and relies on Apple already having spotted that there’s a new image, too. And it takes an image already resized and re-encoded by Apple. All of this isn’t good.

So I now do this:

The ultimate podcast thumbnail cache

  • “The Daily”’s thumbnail is now referenced from my pages called example.com/images/123456-a087d5dc.jpg
  • 123456 is the Apple ID, as above.
  • a087d5dc is a crc32 hash of the real image location, from the RSS feed. I cache everything else about the podcast for a maximum of seven days — including the real location of the image.
  • Using the same setup as above, the image is hosted on S3.
  • If the file isn’t there, it will dynamically make it (looking at my podcast database), and then upload a copy there.
  • If the image location changes in the RSS feed, so will the image name: since the hash of the location will also change.

This means no reliance on Apple’s iTunes API; and images directly from the source instead of resizing a copy.

The only drawback is that I have to make a database call to build the podcast image URL. But on a brighter note, most of the time, I’m making that database call already. And on the occasions I wasn’t, the fact I now have to make a database call means I can also programmatically get the podcast’s name for the ALT tag.

All in all, I’m not quite sure why I’m telling you, but I’m quite pleased about it.