Everyone visiting your English-language-only website reads it in English, right? That’s a safe assumption to make, right? Well, no. Search engines including Baidu, Bing, Google, Naver, and Yandex include “Translate page” links next to results in foreign languages. Some web browsers like Google Chrome, Naver Browser (Whale), and Yandex Browser will auto-translate pages in foreign languages for their users.
While you may only produce content in one language, your content isn’t exclusively consumed in the original language.
Machine-translation is an often unmeasured page-quality factor that greatly impacts your readers’ experience and comprehension. Without it, they may literally not understand a word of what you’re trying to communicate. However, machine-translations aren’t yet perfect. A poor-quality machine-translation may be to blame if a high percentage of readers abandon your pages quickly.
This may be an indication of poor grammar and spelling in the original language version. Or it may be an indication of a technical problem with your webpages. Your pages don’t declare the proper language using language codes like
<html lang="en"> at the start of the page. Or you may have an issue with improper use of text-level semantic structure.
Whatever the issue, it can be useful to detect the presence of and measure how machine-translations affect your webpages’ performance.
The latter type is more difficult to detect as the text is translated and swapped in the web browser. However, these services can still be detected by looking for
document object modification artifacts that create a recognizable fingerprint.
Before I get into the weeds on this one, I’d like to take a moment to comment on the lack of standardization in this space.
The machine-translation services work by replacing text with machine-translations. However, they don’t mark-up the translated versions of the page properly to indicate the new language, e.g. using the
lang attribute. This can cause issues for browser extensions and accessibility services such as text-to-speech synthesizers.
Additionally, there should be a standard event emitted when a page is translated to tell the page that “hey, all the loaded content have been changed by a translation service”. This would make it far easier to detect these service’s handiwork. It would also enable developers to re-adjust their page layouts or display a notice to their readers indicating that they’re reading a machine-translation.
The rest of the article will cover how to detect machine-translation services despite the lack of standards. Due to the lack of standardization, the methods will require ongoing testing with as the translation services change over time.
Proxy-based (server-side) machine-translations
I’ll start by getting the easy bit out of the way first: detecting proxied machine-translation services. You can do this straight away without having to wait for the document to load fully.
A quick market refresher: Google and Microsoft are popular all over the world. Baidu is the market-leading provider of search and online services in China. Naver is hugely popular in South Korea, and have a strong following in Japan and the K-Pop community in the United States and around the world. Yandex controls roughly 50 % of the search market in Russia and is gaining grounds in Turkey. Yandex has rapidly lost market share in Ukraine and Belarus in recent years.
Browser-based (client-side) machine-translations
The browser-based translation services are a bit trickier to detect. These tools usually wait for the document to fully load, and then either sets off translating by default or wait for the user to confirm that they want the document translated. They don’t emit any sort of standard
event or give any other indication that the document has been translated.
If you only want to track translation-service usage for analytics purposes, then you can hold off on detecting their presence by hooking into the
event (and not the
unload event) and then quickly inspect the
document for signs that a translation service has been involved.
Google Chrome will insert one of two
class names on the root
html element of a translated document. The two classes to look for are
translated-ltr (for translations to left-to-right languages) and
translated-rtl (for translations to right-to-left languages).
Yandex Browser wraps all translated texts in a custom
ya-tr-span element. It also injects several scripts and stylesheets into translated webpages.
The Microsoft Translator Extension appends
_msttexthash attributes to all translated elements.
You can combine these and check them all at once:
Naver Browser (Whale) doesn’t leave any fingerprints behind when translating pages with it’s Naver Papago translation service. As far as I can tell, you can’t detect it without analyzing the content itself and monitoring it for changes.
You can use a
MutationObserver if you need to be notified about the use of a translation service as early as possible in the page’s lifecycle.