Language tags are short, standardized strings that name spoken languages.

We need standardized names, or some people will tag French documents as "French," others will use "Français," others still might use "France," and lazy people might just use "Fra" or "F." Standardized language tags avoid this confusion.

There are language tags for English (en), German (de), Korean (ko), and many other languages. Language tags can describe regional variants and dialects of languages, such as Brazilian Portuguese (pt-BR), U.S. English (en-US), and Hunan Chinese (zh-xiang). There is even a standard language tag for Klingon (i-klingon)!

The Content-Language Header

The Content-Language entity header field describes the target audience languages for the entity. If the content is intended primarily for a French audience, the Content-Language header field would contain:

Content-Language: fr

The Content-Language header isn't limited to text documents. Audio clips, movies, and applications might all be intended for a particular language audience. Any media type that is targeted to particular language audiences can have a Content-Language header. In Screenshot 16-8, the audio file is tagged for a Navajo audience.

Content-Language header marks a "Rain Song" audio clip for Navajo speakers
Content-Language header marks a "Rain Song" audio clip for Navajo speakers
(Screenshot 16-8.)

If the content is intended for multiple audiences, you can list multiple languages. As suggested in the HTTP specification, a rendition of the "Treaty of Waitangi," presented simultaneously in the original Maori and English versions, would call for:

Content-Language: mi, en

However, just because multiple languages are present within an entity does not mean that it is intended for multiple linguistic audiences. A beginner's language primer, such as "A First Lesson in Latin," which clearly is intended to be used by an English-literate audience, would properly include only "en".

The Accept-Language Header

Most of us know at least one language. HTTP lets us pass our language restrictions and preferences along to web servers. If the web server has multiple versions of a resource, in different languages, it can give us content in our preferred language.

Servers also can use the Accept-Language header to generate dynamic content in the language of the user or to select images or target language-appropriate merchandising promotions.

Here, a client requests Spanish content:

Accept-Language: es

You can place multiple language tags in the Accept-Language header to enumerate all supported languages and the order of preference (left to right). Here, the client prefers English but will accept Swiss German (de-CH) or other variants of German (de):

Accept-Language: en, de-CH, de

Clients use Accept-Language and Accept-Charset to request content they can understand. We'll see how this works in more detail in Chapter 17.

Types of Language Tags

Language tags have a standardized syntax, documented in RFC 3066, "Tags for the Identification of Languages." Language tags can be used to represent:

·         General language classes (as in "es" for Spanish)

·         Country-specific languages (as in "en-GB" for English in Great Britain)

·         Dialects of languages (as in "no-bok" for Norwegian "Book Language")

·         Regional languages (as in "sgn-US-MA" for Martha's Vineyard sign language)

·         Standardized nonvariant languages (e.g., "i-navajo")

·         Nonstandard languages (e.g., "x-snowboarder-slang" )

Describes the unique dialect spoken by "shredders."

Subtags

Language tags have one or more parts, separated by hyphens, called subtags:

·         The first subtag called the primary subtag. The values are standardized.

·         The second subtag is optional and follows its own naming standard.

·         Any trailing subtags are unregistered.

The primary subtag contains only letters (A-Z). Subsequent subtags can contain letters or numbers, up to eight characters in length. An example is shown in Screenshot 16-9.

Language tags are separated into subtags
Language tags are separated into subtags
(Screenshot 16-9.)

Capitalization

All tags are case-insensitive-the tags "en" and "eN" are equivalent. However, lowercasing conventionally is used to represent general languages, while uppercasing is used to signify particular countries. For example, "fr" means all languages classified as French, while "FR" signifies the country France.

This convention is recommended by ISO standard 3166.

IANA Language Tag Registrations

The values of the first and second language subtags are defined by various standards documents and their maintaining organizations. The IANA administers the list of standard language tags, using the rules outlined in RFC 3066.

See http://www.iana.org and RFC 2860.

If a language tag is composed of standard country and language values, the tag doesn't have to be specially registered. Only those language tags that can't be composed out of the standard country and language values need to be registered specially with the IANA. The following sections outline the RFC 3066 standards for the first and second subtags.

At the time of writing, only 21 language tags have been explicitly registered with the IANA, including Cantonese ("zh-yue"), New Norwegian ("no-nyn"), Luxembourgish ("i-lux"), and Klingon ("i-klingon"). The hundreds of remaining spoken languages in use on the Internet have been composed from standard components.

First Subtag: Namespace

The first subtag usually is a standardized language token, chosen from the ISO 639 set of language standards. But it also can be the letter "i" to identify IANA-registered names, or "x" for private, extension names. Here are the rules:

If the first subtag has:

·         Two characters, it is a language code from the ISO 639 and 639-1 standards

See ISO standard 639, "Codes for the representation of names of languages."

·         Three characters, it is a language code listed in the ISO 639-2 standard and extensions

See ISO 639-2, "Codes for the representation of names of languages-Part 2: Alpha-3 code."

·         The letter "i," the language tag is explicitly IANA-registered

·         The letter "x," the language tag is a private, nonstandard, extension subtag

The ISO 639 and 639-2 names are summarized in Appendix G. A few examples are shown here in Table 16-5.

Table 16-5. Sample ISO 639 and 639-2 language codes

Language ISO 639 ISO 639-2
Arabic ar ara
Chinese zh chi/zho
Dutch nl dut/nla
English en eng
French fr fra/fre
German de deu/ger
Greek (Modern) el ell/gre
Hebrew he heb
Italian it ita
Japanese ja jpn
Korean ko kor
Norwegian no nor
Russian ru rus
Spanish es esl/spa
Swedish sv sve/swe
Turkish tr tur

Second Subtag: Namespace

The second subtag usually is a standardized country token, chosen from the ISO 3166 set of country code and region standards. But it may also be another string, which you may register with the IANA. Here are the rules:

If the second subtag has:

·         Two characters, it's a country/region defined by ISO 3166

The country codes AA, QM-QZ, XA-XZ and ZZ are reserved by ISO 3166 as user-assigned codes. These must not be used to form language tags.

·         Three to eight characters, it may be registered with the IANA

·         One character, it is illegal

Some of the ISO 3166 country codes are shown in Table 16-6. The complete list of country codes can be found in Appendix G.

Table 16-6. Sample ISO 3166 country codes

Country Code
Brazil BR
Canada CA
China CN
France FR
Germany DE
Holy See (Vatican City State) VA
Hong Kong HK
India IN
Italy IT
Japan JP
Lebanon LB
Mexico MX
Pakistan PK
Russian Federation RU
United Kingdom GB
United States US

Remaining Subtags: Namespace

There are no rules for the third and following subtags, apart from being up to eight characters (letters and digits).

Configuring Language Preferences

You can configure language preferences in your browser profile.

Netscape Navigator lets you set language preferences through Edit Preferences . . . Languages . . . , and Microsoft Internet Explorer lets you set languages through Tools Internet Options . . . Languages.

Language Tag Reference Tables

Appendix G contains convenient reference tables for language tags:

·         IANA-registered language tags are shown in Table G-1.

·         ISO 639 language codes are shown in Table G-2.

·         ISO 3166 country codes are shown in Table G-3.

 


Hypertext Transfer Protocol (HTTP)