Did you know that there are standard language codes for European English and World/International English? I certainly didn’t! The language codes can be used for webpages and some software stacks including Windows 10 and Android. These implementations include support for date and number formatting information.
The language code for World English is “en-001” and European English is “en-150”. The rest of this article tries to explain how that came to be and what the consequences of using these language codes in the real world might be.
Language codes are used by authors to indicate the language of a document or other resource. Users use them to signal to software which languages they can understand. Software tries to use them to assist their users in getting language variants of document and resources that they can understand. But where do they come from and what do they mean?
The origins of ‘language codes’
In computer communication and engineering, you’ll often come across standardized two–three letter or numeric codes for countries and a similar set of codes for languages. These two sets of codes are used together to indicate the locale and formatting preference of a document author or of a user.
These codes as used in computer systems today mostly originate from an international standard known as ISO 3166 Country Codes by the International Organization for Standardization (ISO). These codes are based on the older UN M.49 Standard Country or Area Codes for Statistical Use from the United Nations Statistics Division.
For example, the language code “en-AU” combines the language code for English and the country code of Australia to form the language Australian English. You can combine any language with any country, but software is unlikely to recognize and support uncommon combinations.
ISO 3166 doesn’t include any codes for geographical regions that aren’t nations. The continents and the planet itself thus doesn’t have a standardized code in ISO 3166. If we go back to the source material instead, the M.49, we’ll find a specially designated code “001” which is designated for the geographical region “World” plus other codes for other larger geographical regions such as continents. The ISO 3166 standard has left these region codes unassigned and has avoided any collisions with M.49. I’ll get back to these non-nation region codes later.
Where do the language codes used in software come from?
The Unicode Consortium has combined language, country, and region codes with the officially recognized locale and formatting rules for numbers, dates, and more into the Common Locale Data Repository (CLDR). The locale data in the CLDR is sourced from many current internet and international standards. This data repository is in turn sourced by many leading software and operating systems vendors including Adobe, Apple, Google, IBM, Microsoft.
Data from the CLDR repository is sourced directly in to the popular International Components for Unicode (ICU) library, and is the source data for the locale library in the GNU C Library (glibc). ICU is included with web browsers like Safari, Firefox, and Chrome and is also found in operating systems like MacOS and iOS. glibc is found in “everything Linux”.
ICU was developed by IBM since the . Development has been governed by the international Unicode Consortium since the . The mission aligns closely with the consortium’s other work on providing universal software internationalization.
Locale data found in many popular language and formatting processing libraries also originate from the CLDR dataset. This includes languages like Babel for Python, Microsoft .NET.
Notably, continental and non-nation geographical region variants of any language including English hasn’t been assigned Windows Language Code Identifier (MS-LCID). These languages are thus not recognized in Windows or most Microsoft products. This means that Windows can’t provide locale and formatting information for these types of locales.
Update (): Windows 10 “Anniversary Update” lets you choose English (Europe), English (World), and several other broader-region locales!
The end of the U.S. locale’s hegemony in CLDR
To keep the CLDR dataset more manageable, many locales inherit data from a similar or familiar locale that makes similar choices for formatting. Most English locales used to inherit from English (United States), but in CLDR version 24 this was changed in favor of having all English base languages inherit from either English (World) or English (Europe).
This change significantly reduced the U.S. technological and cultural hegemony over the CLDR dataset. I believe this was a good decision as U.S. companies are often overly focused on the need of the U.S. market at the expense of the international market.
This change in how the CLDR is organized also explicitly added two new language tags into the CLDR: en-001 (International English) and en-150 (European English).
These languages have been technically and conceptually possible to construct using CLDR language and region tags since the beginning of that project. However, their inclusion as explicit datapoints that other languages could inherit rules from meant that CLDR consumer software now would know these two locales as equals to other world languages.
Through CLDR’s efforts to reduce dependency on the en-US locale, they inadvertently also introduced en-001 and en-150 as meaningful options in thousands of software projects.
Although many programs and libraries will recognize locales like en-021 for North-American English, and en-053 for Australian and New Zealand English; they won’t provide locale and formatting data for these languages.
Web browser and software support
Firefox, Google Chrome, and Safari are built using ICU libraries. Recent versions of these browsers, and browsers built upon them, should have full support for at least World English as well as European English. By full support, I mean to say they can be configured to send these language tags in their
Older versions and other unsupported web browsers will still get a partial match on the en bit of these language tag. It will match “en” as English even if it can’t recognize “en-150” as European English.
The default locale for en is up to the implementation, but you’ll usually find en-US. I expect en-001 will take the place of en-US in at least some implementations in the not so distant future. The algorithm for partial matches is specified in section 3,4 of RFC 4647.
The algorithm that allows for partial matches is also the saving grace for any users, content authors, and programmers who either produce or consume webpages in European or International English. If a web browser or server doesn’t understand or provide content in the preferred language, the fallback language “English (unspecified)” or even a different regional variant like (English Australia) will be returned instead depending on the exact server configuration.
Older versions of Microsoft Internet Information Services (IIS) server doesn’t properly implement the partial match algorithm correctly, and can return unexpected results if the web browser is configured to accept more than four different language tags. I once could have produced a long list of other broken implementations here, but I no longer have any more data on this subject.
It should be pretty safe to use en-001 and en-001 on documents the web. The ICU website lists Google Search as one of their users, which should be reassuring to most skeptics worried about the potential impact of changing their language tags on their search rankings.
What is International and World English anyway?
I left this to the end because I don’t know how to answer it. There isn’t an authorative right answer to the question. What is any language? Ask a hundred people and you’ll get a hundred different answers.
Linguistically speaking, I haven’t got the faintest idea of what International English means. Neither languages are formally defined anywhere. But what language is formally defined anywhere? Language is ever changing based on their use. The concept of “International English” can have different meanings to different people.
My own personal definition goes like this: English as used internationally were the speaker or author knows they’re addressing an international audience, and take measures to avoid regional or complicated vocabulary to be better understood by people with different levels of comprehension of English. You could compare my definition to “Simplified English”.
Technically speaking, “English (World)”, or en-001, as defined in CLDR is almost equivalent to “English (United States)”, except for some locale and formatting differences including these:
- Uses dd MM YYYY instead of MM dd YYYY date formatting.
- First day of the week is Monday instead of Sunday.
- Currency symbol is US$ instead of $.
“English (Europe)”, or en-150, as defined in CLDR is equivalent to “English (world)” with the following exceptions:
- Uses a comma instead of a period for decimal separator.
- Uses a 24-hour instead of 12-hour time format.
- Currency symbol is ¤ instead of US$.
For a more complete understanding of what formatting and locale data is included with a given locale, please refer to the CLDR.
Both International and European English are well suited for users who want one formatting or the other, but who don’t feel like they’ve any connection to a nation state. Netizens of the interwebs might feel compelled to prefer International English, or maybe European English if they want sensible formatting that follows international standards.