James Cridland

Making a scalable Google Analytics alternative for pageviews on AWS

If the only thing you want to measure is pageviews, then here’s a simple(ish) scalable way to do it, if you use Amazon Web Services or similar.

Gather the data

  1. You’ll need a new subdomain, or a new domain entirely. If your website is at example.com then how about clicks.example.com as an example.
  2. Upload a transparent GIF file (the smallest legal one I found was 42 bytes) to a new Amazon S3 bucket. You could call it “c”. Set the Content-type to image/gif
  3. Point clicks.example.com at Amazon Cloudfront. Tell that Cloudfront distribution to a) get its data from your Amazon S3 bucket; b) to upload its logs to a different Amazon S3 bucket; and c) not to cache anything.
  4. In your website, you want a line or a JavaScript call near the top of your page which has a) a random number in it to reduce caching, b) what page it was called from (canonical is your friend), c) what type of event this is. This is what I use (a tag written by PHP)…
<img id="wow" style="float:left;" alt="" width="1" height="1" src="">

…an image pixel, populated by JavaScript to avoid caching.

Once you’ve set up all of this, you’ve got a separate, fully-scalable website in super-reliable Cloudfront. It’s going to write just one log-line per pageview, and it’ll put this log into an Amazon S3 bucket for you to query later.

Parse the data

Then, set up an Athena database for your Amazon S3 log-bucket. This will magically parse the data you have in your Amazon S3 log-bucket as if it’s a (slow, expensive) SQL database.

In your Amazon S3 log-bucket, you probably want to set an expiry after which everything automatically gets deleted. I’ve set mine for 180 days, because I’m not sure when I’d like to go in and check for more data.

Once a day at 2am, I run a script that runs a database call on Athena to parse the data. The SQL query is something like:

select query_string,"date" from "default"."cloudfront_logs"
WHERE "date"=current_date  interval '1' day
AND SUBSTR("query\_string", -2)='pv'

This produces a list of all pageviews for “yesterday”, which I then parse and insert into a database, along with the date. You can remove bots, and by using a hash of IP+useragent, you can work out unique views as well. All of that lets me have a small database of pageviews-per-day for each page on the site.

You could parse this data for user-agents, or even push it into things like https://goaccess.io/ (remembering that this is *just* one line per pageview). I don’t, currently: I literally just want a pagecounter at the moment, but one which is scalable (and doesn’t cause a database to be updated every page hit).

So, that’s nice.