How to detect machine-translated webpages

Everyone visiting your English-language-only website reads it in English, right? That’s a safe assumption to make, right? Well, no. Search engines including Baidu, Bing, Google, Naver, and Yandex include “Translate page” links next to results in foreign languages. Some web browsers like Google Chrome, Naver Browser (Whale), and Yandex Browser will auto-translate pages written in foreign languages to their users’ native languages.

As a writer, you likely only write in one language. Yet, your content isn’t exclusively consumed in the original language. Instant machine-translations has significantly lowered the web’s language barriers.

Machine-translation is an often unmeasured page-quality factor that greatly impacts your readers’ experience and comprehension. Without it, they may literally not understand a word of what you’re trying to communicate. However, machine-translations aren’t yet perfect. A poor-quality machine-translation may be to blame if a high percentage of readers abandon your pages quickly.

This may be an indication of poor grammar and spelling in the original language version. Or it may be an indication of a technical problem with your webpages. Your pages don’t declare the proper language using language codes like <html lang="en"> at the start of the page. Or you may have an issue with improper use of text-level semantic structure.

Whatever the issue, it can be useful to detect the presence of and measure how machine-translations affect your webpages’ performance. It can affect both page-loading performance, engagement, and other measurements your consider to be important.

As I started getting into earlier, there are two primary types of machine-translation services: proxy-based web-tools and web-browser based tools. The former type proxies pages through a translation-service and serves it up to the reader from their servers. These can be detected in JavaScript by reading the window.location property.

The latter type is more difficult to detect as the text is translated and swapped in the web browser. However, these services can still be detected by looking for document object modification artifacts that create a recognizable fingerprint.

No standards

Before I go too far into the weeds on this one, I’d like to take a moment to comment on the lack of standardization in this space. Machine-translations is something more and more web browsers do, but like Reader Mode, it’s an unstandardized rendering mode.

The machine-translation services work by replacing text with machine-translations. However, they don’t mark-up the translated versions of the page properly to indicate the new language, e.g. using the lang attribute. This can cause issues for browser extensions and accessibility services such as text-to-speech synthesizers.

Additionally, there should be a standard event emitted when a page is translated to tell the page that “hey, all the loaded content have been changed by a translation service”. This would make it far easier to detect these service’s handiwork. It would also enable developers to re-adjust their page layouts or display a notice to their readers indicating that they’re reading a machine-translation.

The rest of the article will cover how to detect machine-translation services despite the lack of standards. Due to the lack of standardization, the methods will require ongoing testing with as the translation services change over time.

Proxy-based (server-side) machine-translations

I’ll start by getting the easy bit out of the way first: detecting proxied machine-translation services. You can do this straight away without having to wait for the document to load fully.

The following is a list of conditional tests in JavaScript for detecting popular translation services.

// SPDX-License-Identifier: CC0-1.0

let hostname = window.location.hostname;

// Google Translate
hostname == "translate.googleusercontent.com" ||
hostname.startsWith("translate.google.")

// Microsoft Bing Translate
[ "www.translatoruser-int.com",
  "www.translatetheweb.com",
  "ssl.microsofttranslator.com",
  "www.microsofttranslator.com"
].includes(hostname)

// Baidu Translate
[ "translate.baiducontent.com",
  "fanyi.baidu.com"
].includes(hostname)

// Yandex Translate
hostname == "z5h64q92x9.net" ||
hostname.startsWith("translate.yandex.")

// Naver Papago
hostname == "papago.naver.net"

A quick market refresher: Google and Microsoft are popular all over the world. Baidu is the market-leading provider of search and online services in China. Naver is hugely popular in South Korea, and have a strong following in Japan and the K-Pop community in the United States and around the world. Yandex controls roughly 50 % of the search market in Russia and is gaining grounds in Turkey. Yandex has rapidly lost market share in Ukraine and Belarus in recent years.

Browser-based (client-side) machine-translations

The browser-based translation services are a bit trickier to detect. These tools usually wait for the document to fully load, and then either sets off translating by default or wait for the user to confirm that they want the document translated. They don’t emit any sort of standard event or give any other indication that the document has been translated.

If you only want to track translation-service usage for analytics purposes, then you can hold off on detecting their presence by hooking into the pagehide event (and not the unload event) and then quickly inspect the document for signs that a translation service has been involved.

Google Chrome will insert one of two class names on the root html element of a translated document. The two classes to look for are translated-ltr (for translations to left-to-right languages) and translated-rtl (for translations to right-to-left languages).

Yandex Browser wraps all translated texts in a custom ya-tr-span element. It also injects several scripts and stylesheets into translated webpages. It’s quite noisy compared to its competitors.

The Microsoft Translator Extension appends _msttexthash attributes to all translated elements. The extension comes built-in to its Microsoft Edge browser, but can also be used separately in a different browser.

The Firefox Translations extensions (Bergamot) appends a x-bergamot-translated attribute to all visible page elements. The attribute contains information about whether the given element was translated or not. However, its presence can tell you if the page has been through the translator.

You can combine these and check them all at once:

!!document.querySelector(
  "html.translated-ltr, html.translated-rtl, ya-tr-span, *[_msttexthash], *[x-bergamot-translated]"
);

Naver Browser (Whale) doesn’t leave any fingerprints behind when translating pages with it’s Naver Papago translation service. As far as I can tell, you can’t detect it without analyzing the content itself and monitoring it for changes. This might be the most universal method to detect machine-translations.

You can use a MutationObserver if you need to be notified about the use of a translation service as early as possible in the page’s lifecycle. This might negatively impact the page’s performance in complex documents.

Update (2022-07-10): Added details for detecting the new Firefox Translations (Bergamot) extension.