Millions of (poorly coded) bots relentlessly crawl the web to detect and spew junk content into any form they find. The go-to countermeasure is to force everyone to complete a Completely Automated Public Turing test to tell Computers and Humans Apart (CAPTCHA). CAPTCHAs are those annoying user-hostile tests where you type in skewed letters or identify objects in photos. They require cultural familiarity, introduce accessibility barriers, and waste everyone’s time. Instead of using a CAPTCHA, you can detect and block many bot submissions using completely unobtrusive form validation methods.
The methods I’ll discuss in this article help identify whether the form was submitted via a standards-compatible modern web browser. They’re mini web standards compliance tests that don’t rely on the user doing anything but using a web browser of their choice. It won’t help (much) with forms submitted via browser automation, and they'll only delay bots written specifically to target your website.
Fortunately, the vast majority of comments are still submitted via scripted simpleton bots. They’re much faster and more economical than puppeteering a real web browser into spewing spam across the web.
This article is for developers; almost everything discussed requires modifying the client-side form and server-side validation. This article assumes some familiarity with HyperText Markup Language (HTML) and HyperText Transfer Protocol (HTTP) request headers. Without further ado, let’s get started.
Drop support for obsolete HTTP versions. All modern web browsers use version 3 or 2, with only limited use of HTTP/1.1 as a fallback. You can safely deprecate support for HTTP/0.9 and HTTP/1.0. HTTP/0.9 was introduced in 1991 and HTTP/1.0 was introduced in 1996. HTTP/1.1 was released the following year in 1997 with HTTP/2 not appearing until 2015. Block any form submissions server-side that were sent over these legacy protocols. Many bots are incredibly poorly written and rely on primitive HTTP libraries. This measure is incredibly efficient at blocking the simplest bots.
#hash to the form’s action URL. The action URL is the URL where the form submits data. This only requires a small change to your form and relies on weaponizing URL Standard-unaware clients’ mishandling of URL
#hashes. Bots will literally submit to the wrong URL whereas every browser will know exactly how to handle the situation.
Include a hidden prefilled form field. Bots are lazy and they might be bulk-submitting standard/common field names without bothering with the fields that are in your form. Include any hidden form field and check that it gets submitted to your server. There’s no need to waste resources trying to validate the rest of the form data if the data didn’t even originate from your form.
Verify the Host and Origin request headers. Bots frequently mess up or omit these request headers. They’re required with every form submission in modern browsers. They should match your website’s domain and scheme exactly; e.g. enforce the
https:// prefix and exactly match the host and origin of your form’s action URL.
name attributes in the name and email fields. Only bots will fill in these fields incorrectly, just swap the fields in your form and server-side validation. This approach is more effective than trying to obscure the field names because you can easily catch and block the malicious submissions.
To minimize the impact on users, set the appropriate
type attributes for each field. These attributes will aid input on software keyboards and enable auto-complete and other functionality your visitors will expect. It makes it easier for bots to detect the misdirection, but they can already work out the correct order by string-matching your form’s human-readable labels.
Verify the POST/Redirect/GET (PRG) chain. PRG is a common pattern where a POST request is immediately redirected to a GET request. It helps prevent duplicate form submissions. All browsers will automatically follow redirects unless the user immediately closes the page or loses internet connectivity. You can include a token in a query parameter within the redirect request and verify that the token is sent back to the server.
You probably shouldn’t make blocking decisions based on this alone. Expect the occasional false positive, as described above. However, if the token isn’t returned within a couple of seconds, then it’s a strong indicator the submission was performed by a bot.
Those were my seven simple tips for detecting forms submitted by bots and not browsers. Here’s a bonus tip that requires a little more work to implement compared to the other tips.
Block ancient versions of common browsers. Bots often mimic the User-Agent of a common browser, but the version numbers used in the bots rarely change. Over time they drift farther and farther behind until a point (maybe two-year-old versions) where you can safely block them without inconveniencing legitimate users. GitHub just published numbers that show over 92 % of its visitors use the current browser version or are no more than 5 versions (roughly 20 weeks) behind.
I strongly discourage you from blocking or discriminating against unknown or uncommon browser User-Agent request headers. The web is weird and we as developers shouldn’t discourage it. However, you can take a sneak peek at it and identify the version number from the most common browsers. Armed with the knowledge of the most recent versions of popular browsers, you can block significantly outdated browsers.
Do you know of any other unobtrusive bot detection methods? Share your ideas in the comments!