Semantic markup improves the quality of machine-translated texts

Web browsers like Google Chrome, Yandex.Browser, and Naver Whale; and leading search engines — including Bing, Yandex, Google, Naver, and Baidu — can machine-translate any webpage in seconds. This enables anyone who understands a supported language to access documents written in any other supported language.

I noticed, completely by accident, that some of my most read articles on this blog didn’t translate well in Microsoft Translate. Many of the technical details — such as code blocks, file paths/extensions, and function names — got mangled in with the rest of the text and caused it to lose all meaning. This isn’t really strange considering that the classic UNIX programs are called things like find and at, and program functions are just run-together English phrases. This got me thinking about how you can optimize texts for machine-translation.

An understandable high-quality machine-translation requires rock-solid grammar and spelling in the source material. I found that it can also benefit from rich semantic data about the text.

I’ve marked up some of my writing with text-level semantic HTML elements like abbr (for abbreviations), code (for computer-instructions including file names and program functions), kbd (user input), samp (program output), and var (a variable.) However, I haven’t been consistent and I haven’t even been using these elements properly — especially mixing up code, kbd, and samp.

This can have a meaningful impact on your reader’s comprehension of your texts. The most important difference is that the translation-services take these tags as strong hints not to attempt to translate the text. The kbd element is sometimes translated, however. This seems limited to keyboard shortcuts. E.g. <kbd>Ctrl + Del</kbd> could become <kbd>Strg + Entf</kbd> in German. (This is the expected translation.)

You can think of your code snippets, function names, file paths, URI protocols, and other technical copy as another language with different grammar rules. Programming languages are often inspired by English but use their own grammar, syntax, and style. Intermixing two languages (e.g. English and an English-inspired programming language) can seriously confuse a machine-translation tool.

Just imagine how a machine would understand the following sentence: “The UserAccounts.HomeDir.delete(recursive=true) function will delete the /home/user/ directory.” Ignoring the punctuation and lack of spaces between words, it can be read as English. However, the sentence structure is poor and it translates just as poorly. The resulting translation may lose some of the semantic punctuation, or partially or fully translate the function name and paths. The entire sentence can be rendered meaningless.

In my testing, Microsoft Translator (also known as Bing Translator) and Yandex Translate can benefit from using these semantic hints correctly when translating from English to Norwegian or Russian. I’m unable to properly evaluate the quality of the Russian translations, but I’m at least able to verify that function names and file paths are properly preserved when marked up properly. I’m unable to evaluate Baidu Fanyi which only supports translating to Chinese.

The global translate attribute, new in HTML5, can further enhance machine-translations by labeling texts as translatable or not. For example, you can mark up a code example as untranslatable while marking up comments and strings to be translated. See the below example.

<code translate="no">
// <span translate="yes">Example function</span>
function example()
{
  print("
    <span translate=\"yes\">
      Example translatable string
    </span>
  ");
}
</code>

This works in Google Translate, Microsoft Translate, Yandex Translate, and Naver Papago. Baidu Fanyi doesn’t support the translate attribute. However, Baidu (and the others) does support the legacy non-semantic class="notranslate" attribute. However, I’d recommend focusing on the standard translate attribute unless you specifically need backward-compatibility and compatibility with Baidu (the market leader in the Chinese market.)

Update (2019-10-02): The translate="no" attribute doesn’t always work when Chrome translates a document. Using it together with the class="notranslate" attribute seems to do the trick.

In my testing, the semantic text-level elements were taken as hints and would sometimes be translated despite the markup, such as with the keyboard shortcuts above. The translate="no" attribute was the only variable that always excluded a piece of text from being translated.

Setting a translate="yes" override attribute inside a translate="no" block is only supported by Google Translate at the time of writing. The other translation-tools just ignores it. However, Google Translate doesn’t always preserve whitespace when used in pre-formatted text blocks (like a <pre> element.) Instead, normalizing all white-space to one space character. This can have detrimental results on the semantics and readability of your code examples.

I’ve seen great improvements in the machine-translated versions of my texts when using Microsoft Translator and some improvements when using Google Translate after applying these methods.

This may be an optimization that will lose any value once translation services, machine learning, and artificial intelligence mature. Until then, it’s one more thing to think about when you author technical documents; whether it be about programming and linguistics or something else entirely.

Take-aways: detect and measure machine-translations. Mark up your code examples with HTML text-level semantic markup. Mark up non-translatable texts with translate="no" and <class="notranslate".