Unicode

Java uses the Unicode character encoding. Java 1.0 used Unicode version 1.1, while Java 1.1 has adopted the newer Unicode 2.0 standard. Unicode is a 16-bit character encoding established by the Unicode Consortium, which describes the standard as follows (see http://unicode.org):

The Unicode Worldwide Character Standard is a character coding system designed to support the interchange, processing, and display of the written texts of the diverse languages of the modern world. In addition, it supports classical and historical texts of many written languages.

In its current version (2.0), the Unicode standard contains 38,885 distinct coded characters derived from 25 supported scripts. These characters cover the principal written languages of the Americas, Europe, the Middle East, Africa, India, Asia, and Pacifica.

In the canonical form of the Unicode encoding, which is what Java char and String types use, every character occupies two bytes. The Unicode characters \u0020 to \u007E are equivalent to the ASCII and ISO8859-1 (Latin-1) characters 0x20 through 0x7E. The Unicode characters \u00A0 to \u00FF are identical to the ISO8859-1 characters 0xA0 to 0xFF. Thus there is a trivial mapping between Latin-1 and Unicode characters. A number of other portions of the Unicode encoding are based on pre-existing standards, such as ISO8859-5 (Cyrillic) and ISO8859-8 (Hebrew), though the mappings between these standards and Unicode may not be as trivial as the Latin-1 mapping.

Note that Unicode support is quite limited on many platforms. One of the difficulties with the use of Unicode is the poor availability of fonts to display all of the Unicode characters. Figure 11.1 shows the characters that are available on a typical configuration of the U.S. English Windows platform. Note the special box glyph used to indicate undefined characters.

Figure 11.1: Some Unicode characters and their encodings

[Graphic: Figure 11-1]

Unicode is similar to, but not the same as, ISO 10646, the UCS (Universal Character Set) encoding. UCS is a 2- or 4-byte encoding originally intended to contain all national standard character encodings. For example, it was to include the separate Chinese, Japanese, Korean, and Vietnamese encodings for Han ideographic characters. Unicode, in contrast, "unifies" these disparate encodings into a single set of Han characters that work for all four countries. Unicode has been so successful, however, that ISO 10646 has adopted it in place of non-unified encodings. Thus, ISO 10646 is effectively Unicode, with the option of two extra bytes for expansion purposes.

Unicode is a trademark of the Unicode Consortium. Version 2.0 of the standard is defined by the tutorial The Unicode Standard, Version 2.0 (published by Addison-Wesley, ). Further information about the Unicode standard and the Unicode Consortium can be obtained at http://unicode.org/.

Table 11.1 provides an overview of the Unicode 2.0 encoding.

Outline of the Unicode 2.0 Encoding
Start End Description
FFF Alphabets
F Basic Latin
FF Latin-1 Supplement
F Latin Extended-A
F Latin Extended-B
AF IPA Extensions
FF Spacing Modifier Letters
F Combining Diacritical Marks
FF Greek
FF Cyrillic
F Armenian
FF Hebrew
FF Arabic
F Devanagari
FF Bengali
F Gurmukhi
AFF Gujarati
F Oriya
BFF Tamil
F Telugu
CFF Kannada
F Malayalam
F Thai
EFF Lao
FBF Tibetan
FF Georgian
FF Hangul Jamo
EFF Latin Extended Additional
FFF Greek Extended
FFF Symbols and Punctuation
F General Punctuation
F Superscripts and Subscripts
CF Currency Symbols
FF Combining Marks for Symbols
F Letterlike Symbols
F Number Forms
FF Arrows
FF Mathematical Operators
FF Miscellaneous Technical
F Control Pictures
F Optical Character Recognition
FF Enclosed Alphanumerics
F Box Drawing
F Block Elements
FF Geometric Shapes
FF Miscellaneous Symbols
BF Dingbats
FF CJK Auxiliary
F CJK Symbols and Punctuation
F Hiragana
FF Katakana
F Bopomofo
F Hangul Compatibility Jamo
F Kanbun
FF Enclosed CJK Letters and Months
FF CJK Compatibility
FFF CJK Unified Ideographs Han characters used in China, Japan, Korea, Taiwan, and Vietnam
AC00 Hangul Syllables
DFFF Surrogates
DB7F High Surrogates
DB80 DBFF High Private Use Surrogates
DC00 DFFF Low Surrogates
FF Private Use
FFFF Miscellaneous
FAFF CJK Compatibility Ideographs
FB00 FB4F Alphabetic Presentation Forms
FB50 FDFF Arabic Presentation Forms-A
FE20 FE2F Combining Half Marks
FE30 FE4F CJK Compatibility Forms
FE50 FE6F Small Form Variants
FE70 FEFE Arabic Presentation Forms-B
FEFF FEFF Specials
FF00 FFEF Halfwidth and Fullwidth Forms
FFF0 FFFF Specials

Unicode and Local Encodings

While Java programs use Unicode text internally, Unicode is not the customary character encoding for most countries or locales. Thus, an important requirement for Java programs is to be able to convert text from the local encoding to Unicode as it is read (from a file or network, for example) and to be able to convert text from Unicode to the local encoding as it is written. In Java 1.0, this requirement is not well supported. In Java 1.1, however, the conversion can be done with the java.io.InputStreamReader and java.io.OutputStreamWriter classes, respectively. These classes load an appropriate ByteToCharConverter or CharToByteConverter class to perform the conversion. Note that these converter classes are part of the sun.io package and are not for public use (although an explicit conversion interface may be defined in a later release of Java).

The UTF-8 Encoding

The canonical two-bytes per character encoding is useful for the manipulation of character data and is the internal representation used throughout Java. However, because a large amount of text used by Java programs is 8-bit text, and because there are so many existing computer systems that support only 8-bit characters, the 16-bit canonical form is usually not the most efficient way to store Unicode text nor the most portable way to transmit it.

Because of this, other encodings called "transformation formats" have been developed. Java provides simple support for the UTF-8 encoding with the DataInputStream.readUTF() and DataOutputStream.writeUTF() methods. UTF-8 is a variable-width or "multi-byte" encoding format; this means that different characters require different numbers of bytes. In UTF-8, the standard ASCII characters occupy only one byte, and remain untouched by the encoding (i.e., a string of ASCII characters is a legal UTF-8 string). As a tradeoff, however, other Unicode characters occupy two or three bytes.

In UTF-8, Unicode characters between \u0000 and \u007F occupy a single byte, which has a value of between 0x00 and 0x7F, and which always has its high-order bit set to 0. Characters between \u0080 and \u07FF occupy two bytes, and characters between \u0800 and \uFFFF occupy three bytes. The first byte of a two-byte character always has high-order bits 110, and the first byte of a three-byte character always has high-order bits 1110. Since single-byte characters always have 0 as their high-order bit, the one-, two-, and three-byte characters can easily be distinguished from each other.

The second and third bytes of two- and three-byte characters always have high-order bits 10, which distinguishes them from one-byte characters, and also distinguishes them from the first byte of a two- or three-byte sequence. This is important because it allows a program to locate the start of a character in a multi-byte sequence.

The remaining bits in each character (i.e., the bits that are not part of one of the required high-order bit sequences) are used to encode the actual Unicode character data. In the single-byte form, there are seven bits available, suitable for encoding characters up to \u007F. In the two-byte form, there are 11 data bits available, which is enough to encode values to \u07FF, and in the three-byte form there are 16 available data bits, which is enough to encode all 16-bit Unicode characters. Table 11.2 summarizes the UTF-8 encoding.

The UTF-8 Encoding
Start Character End Character Required Data Bits Binary Byte Sequence (x = data bits)
F xxxxxxx
FF xxxxx 10xxxxxx
uFFFF xxxx 10xxxxxx 10xxxxxx

The UTF-8 has the following desirable features:

Java actually uses a slightly modified form of UTF-8. The Unicode character \u0000 is encoded using a two-byte sequence, so that an encoded Unicode string never contains null characters.