I created my own web analytics system from scratch

I mentioned in passing a few weeks back that I’ve removed Google Analytics from Ctrl blog, and that I’d built a replacement web analytics collection and analysis system from scratch. Here are some more details on that endeavor.

I’ve never been comfortable handing information about my visitors over to Google, or been happy about the performance impact, and unsatisfied with how it doesn’t properly track how much time visitors spend on pages. (A key metric in my book.)

The General Data Protection Regulation (GDPR) have re-invigorated the open-source web analytics scene as sharing visitor data with Google suddenly had taken on a higher risk. Taking ownership over data collection practices and not sharing anything with third-parties is less risky under the GDPR.

Analyzing web server logs to discern traffic patterns is entirely pointless on the modern web and not without GDPR risk. You can’t gleam any useful information anymore between all the noise and bots pretending to be web browsers, and browser pre-loading pages and resources that no human ever sees. You need to collect data with JavaScript in the browser; there simply isn’t a way around that.

I’ve watched and even contributed to a couple of new open-source analytics platforms before I decided to build my own analytics collection system instead.

I realized that I wanted to focus on different trade-offs and collecting slightly different (and less) data than Google Analytics and even the new open-source alternatives. Rather than focusing on “visits” I wanted to focus on getting more accurate time-on-page information. The trouble with this approach is that you’ve to wait until the visitor is done with the page to record the time spent on that page.

To track time-on-page in the browser it was important to find a reliable way to submit data as visitors navigate away from pages (without hurting their perceived performance). I discussed the collection method and page life-cycle event handling in detail when outlining iOS and Safari’s Beacon API problems.

The approach I took to capture the time spent on page was to record the document.visibilityState (hidden or visible) when the document is loaded, and listen for the visibilitychange event to track the time that the document is visible on screen. Other approaches, like tracking scroll-depth or counting a stream of heart-beats would negatively impact user performance. Analytics isn’t important to them so I better not get in the way of their experience serving my own need.

I added a ten second minimum time requirement to effectively filter out bots, accidentally opened page, unwanted restored-session-tabs, quickly reopening a page while going through their history, and other insignificant data points. Someone opening a page from Ctrl blog in a background tab and then closing it without ever opening it isn’t something I care to track.

My new system records about 11 % more visitors than Google Analytics ever did despite the filtering, the trouble recording visits from old versions of Safari, and another bug with the NoScript extension—bundled by default with the Tor Browser—which causes it to block the Beacon API regardless of a domain’s JavaScript execution permission.

This difference is probably caused by Google Analytics being blocked by so many visitors due to the recent awareness and privacy-backlash against tech companies and especially third-party web analytics services. I’ve been saying for years that first-party analytics is fine; the trouble is with third-party analytics that can track you across the web.

In addition to collecting the current page address and time-on-page, I also collect the User-Agent string and the page referrer. (Most websites won’t need to collect the User-Agent, but I’ve got a way higher than normal interest in web browsers and User-Agents in particular.)

I’m only storing IP addresses for less than an hour for rate-limiting purposes, and no other personal data is retained or combined with other data. I make sure to minify and even delete some unique data such as User-Agents and referring addresses submitted by very few visitors.

There’s no persistent cookies involved anywhere in the process. I may change this in the future to record retention rates (e.g. how many days, if any, since the last site visit), but without associating people with a unique identifier.

All in all it took way less time developing this than I’d estimated and I’m happy with the results. I’d encourage anyone considering implementing Google Analytics on their website consider what data they need to collect, and then evaluate whether they could quickly write up their own private data collection system.

At some indeterminate time in the future, I’ll consider making the system more fit for other uses and releasing it as an open-source project. I’d need to spend more time on making presentable reports out of the collected data first. My current dashboard works for me but I don’t expect anyone else would understand it at all.