HTML 4.01 Language Features

Coordinating character sets is only the first part of the challenge. Even languages that share a character set may have different rules for hyphenation, spacing, quotation marks, punctuation, and so on. In addition to character shapes (glyphs), issues such as directionality (whether the text reads left-to-right or right-to-left) and cursive joining behavior have to be taken into account as well.

This prompted a need for a system of language identification. The W3C responded by incorporating into HTML the language tags put forth in the RFC 2070 standard on internationalization.

The lang Attribute

The lang attribute can be added within any tag to specify the language of the contained element. It can also be added within the <html> tag to specify a language for an entire document. The following example specifies the document's language as French:

<HTML LANG="fr">

It can also be used within text elements to switch to other languages within a document; for example, you can "turn on" Norwegian for just one element:

<BLOCKQUOTE lang="no">...</BLOCKQUOTE>

The value for the lang attribute is a language code (not the same as a country code). The current HTML and XML specifications support the two-letter country codes established in RFC 1766. These are listed in Table 7-1. However, there have been advancements in language identification to include three-letter codes, two-letter codes with country subcode (for example, fr-CA for French as used in Canada), and other descriptive subcodes as proposed in RFC 3066. Eventually, this revised system will be supported in future updates of HTML and XML specifications.

Table 7-1. Two-letter codes of language names

Code Country Code Country Code Country
aa Afar fy Frisian lv Latvian
ab Abkhazian ga Irish mg Malagasy
af Afrikaans gd Scots Gaelic mi Maori
am Amharic gl Galician mk Macedonian
ar Arabic gn Guarani ml Malayalam
as Assamese gu Gujarati mn Mongolian
ay Aymara ha Hausa mo Moldavian
az Azerbaijani he Hebrew (formerly iw) mr Marathi
ba Bashkir hi Hindi ms Malay
be Byelorussian hr Croatian mt Maltese
bg Bulgarian hu Hungarian my Burmese
bh Bihari hy Armenian na Nauru
bi Bislama ia Interlingua ne Nepali
bn Bengali; Bangla id Indonesian (formerly in) nl Dutch
bo Tibetan ie Interlingue no Norwegian
br Breton ik Inupiak oc Occitan
ca Catalan is Icelandic om (Afan) Oromo
co Corsican it Italian or Oriya
cs Czech iu Inuktitut pa Punjabi
cy Welsh ja Japanese pl Polish
da Danish jw Javanese ps Pashto, Pushto
de German ka Georgian pt Portuguese
dz Bhutani kk Kazakh qu Quechua
el Greek kl Greenlandic rm Rhaeto-Romance
en English km Cambodian rn Kirundi
eo Esperanto kn Kannada ro Romanian
es Spanish ko Korean ru Russian
et Estonian ks Kashmiri rm Kinyarwanda
eu Basque ku Kurdish sa Sanskrit
fa Persian ky Kirghiz sd Sindhi
fi Finnish la Latin sg Sangho
fj Fiji lm Lingala sh Serbo-Croatian
fo Faroese lo Laothian si Sinhalese
fr French lt Lithuanian sk Slovak
sl Slovenian tg Tajik uk Ukrainian
sm Samoan th Thai ur Urdu
sn Shona ti Tigrinya uz Uzbek
so Somali tk Turkmen vi Vietnamese
sq Albanian tl Tagalog vo Volapuk
sr Serbian tn Setswana wo Wolof
ss Siswati to Tonga xh Xhosa
st Sesotho tr Turkish yi Yiddish (formerly ji)
su Sundanese ts Tsonga yo Yoruba
sv Swedish tt Tatar za Zhuang
sw Swahili tw Twi zh Chinese
ta Tamil ug Uighur zu Zulu
te Telugu

Directionality

An internationalized HTML standard needs to take into account that many languages read from right to left. Directionality is part of a character's encoding within Unicode.

The HTML 4.01 specification provides the new dir attribute for specifying the direction in which the text should be interpreted. It can be used in conjunction with the lang attribute and may be added within the tags of most elements. The accepted value for direction is either ltr for left-to-right or rtl for right-to-left. For example, the following code indicates that the paragraph is intended to be displayed in Arabic, reading from right to left:

<P LANG="ar" DIR="rtl">...</P>

There is also a new tag introduced in HTML 4.01 that deals specifically with documents that contain combinations of left- and right-reading text (bidirectional text, or Bidi for short). The <bdo> tag is used for "bidirectional override," in other words, to specify a span of text that should override the intrinsic direction (as inherited from Unicode) of the text it contains. The <bdo> tag takes the dir attribute as follows:

<BDO DIR="ltr">English phrase in an otherwise Hebrew text</BDO>...

The <bdo> element and dir attribute are currently not supported by browsers.

Cursive Joining Behavior

In some writing systems, the shape of a character varies depending on its position in the word. For instance, in Arabic, a character used at the beginning of a word looks completely different when it is used as the last character of a word. Generally, this joining behavior is handled within the software, but there are Unicode characters that give precise control over joining behavior. They have zero width and are placed between characters purely to specify whether the neighboring characters should join.

HTML 4.01 provides mnemonic character entities for both these characters, as shown in Table 7-2.

Table 7-2. Unicode characters for joining behavior

Mnemonic Numeric Name Description
&zwnj; zero-width non-joiner Prevents joining of characters that would otherwise be joined
&zwj; zero-width joiner Joins characters that would otherwise not be joined