Unicode

Java uses the Unicode character encoding. Java 1.0 used Unicode version 1.1, while Java 1.1 has adopted the newer Unicode 2.0 standard. Unicode is a 16-bit character encoding established by the Unicode Consortium, which describes the standard as follows (see http://unicode.org):

The Unicode Worldwide Character Standard is a character coding system designed to support the interchange, processing, and display of the written texts of the diverse languages of the modern world. In addition, it supports classical and historical texts of many written languages.
In its current version (2.0), the Unicode standard contains 38,885 distinct coded characters derived from 25 supported scripts. These characters cover the principal written languages of the Americas, Europe, the Middle East, Africa, India, Asia, and Pacifica.

In the canonical form of the Unicode encoding, which is what Java char and String types use, every character occupies two bytes. The Unicode characters \u0020 to \u007E are equivalent to the ASCII and ISO8859-1 (Latin-1) characters 0x20 through 0x7E. The Unicode characters \u00A0 to \u00FF are identical to the ISO8859-1 characters 0xA0 to 0xFF. Thus there is a trivial mapping between Latin-1 and Unicode characters. A number of other portions of the Unicode encoding are based on pre-existing standards, such as ISO8859-5 (Cyrillic) and ISO8859-8 (Hebrew), though the mappings between these standards and Unicode may not be as trivial as the Latin-1 mapping.

Note that Unicode support is quite limited on many platforms. One of the difficulties with the use of Unicode is the poor availability of fonts to display all of the Unicode characters. Figure 11.1 shows the characters that are available on a typical configuration of the U.S. English Windows platform. Note the special box glyph used to indicate undefined characters.

Figure 11.1: Some Unicode characters and their encodings

[Graphic: Figure 11-1]

Unicode is similar to, but not the same as, ISO 10646, the UCS (Universal Character Set) encoding. UCS is a 2- or 4-byte encoding originally intended to contain all national standard character encodings. For example, it was to include the separate Chinese, Japanese, Korean, and Vietnamese encodings for Han ideographic characters. Unicode, in contrast, "unifies" these disparate encodings into a single set of Han characters that work for all four countries. Unicode has been so successful, however, that ISO 10646 has adopted it in place of non-unified encodings. Thus, ISO 10646 is effectively Unicode, with the option of two extra bytes for expansion purposes.

Unicode is a trademark of the Unicode Consortium. Version 2.0 of the standard is defined by the tutorial The Unicode Standard, Version 2.0 (published by Addison-Wesley, ). Further information about the Unicode standard and the Unicode Consortium can be obtained at http://unicode.org/.

Table 11.1 provides an overview of the Unicode 2.0 encoding.

Outline of the Unicode 2.0 Encoding
Start	End	Description
FFF	Alphabets
F	Basic Latin
FF	Latin-1 Supplement
F	Latin Extended-A
F	Latin Extended-B
AF	IPA Extensions
FF	Spacing Modifier Letters
F	Combining Diacritical Marks
FF	Greek
FF	Cyrillic
F	Armenian
FF	Hebrew
FF	Arabic
F	Devanagari
FF	Bengali
F	Gurmukhi
AFF	Gujarati
F	Oriya
BFF	Tamil
F	Telugu
CFF	Kannada
F	Malayalam
F	Thai
EFF	Lao
FBF	Tibetan
FF	Georgian
FF	Hangul Jamo
EFF	Latin Extended Additional
FFF	Greek Extended
FFF	Symbols and Punctuation
F	General Punctuation
F	Superscripts and Subscripts
CF	Currency Symbols
FF	Combining Marks for Symbols
F	Letterlike Symbols
F	Number Forms
FF	Arrows
FF	Mathematical Operators
FF	Miscellaneous Technical
F	Control Pictures
F	Optical Character Recognition
FF	Enclosed Alphanumerics
F	Box Drawing
F	Block Elements
FF	Geometric Shapes
FF	Miscellaneous Symbols
BF	Dingbats
FF	CJK Auxiliary
F	CJK Symbols and Punctuation
F	Hiragana
FF	Katakana
F	Bopomofo
F	Hangul Compatibility Jamo
F	Kanbun
FF	Enclosed CJK Letters and Months
FF	CJK Compatibility
FFF	CJK Unified Ideographs Han characters used in China, Japan, Korea, Taiwan, and Vietnam
AC00	Hangul Syllables
DFFF	Surrogates
DB7F	High Surrogates
DB80	DBFF	High Private Use Surrogates
DC00	DFFF	Low Surrogates
FF	Private Use
FFFF	Miscellaneous
FAFF	CJK Compatibility Ideographs
FB00	FB4F	Alphabetic Presentation Forms
FB50	FDFF	Arabic Presentation Forms-A
FE20	FE2F	Combining Half Marks
FE30	FE4F	CJK Compatibility Forms
FE50	FE6F	Small Form Variants
FE70	FEFE	Arabic Presentation Forms-B
FEFF	FEFF	Specials
FF00	FFEF	Halfwidth and Fullwidth Forms
FFF0	FFFF	Specials

Unicode and Local Encodings

While Java programs use Unicode text internally, Unicode is not the customary character encoding for most countries or locales. Thus, an important requirement for Java programs is to be able to convert text from the local encoding to Unicode as it is read (from a file or network, for example) and to be able to convert text from Unicode to the local encoding as it is written. In Java 1.0, this requirement is not well supported. In Java 1.1, however, the conversion can be done with the java.io.InputStreamReader and java.io.OutputStreamWriter classes, respectively. These classes load an appropriate ByteToCharConverter or CharToByteConverter class to perform the conversion. Note that these converter classes are part of the sun.io package and are not for public use (although an explicit conversion interface may be defined in a later release of Java).

The UTF-8 Encoding

The canonical two-bytes per character encoding is useful for the manipulation of character data and is the internal representation used throughout Java. However, because a large amount of text used by Java programs is 8-bit text, and because there are so many existing computer systems that support only 8-bit characters, the 16-bit canonical form is usually not the most efficient way to store Unicode text nor the most portable way to transmit it.

Because of this, other encodings called "transformation formats" have been developed. Java provides simple support for the UTF-8 encoding with the DataInputStream.readUTF() and DataOutputStream.writeUTF() methods. UTF-8 is a variable-width or "multi-byte" encoding format; this means that different characters require different numbers of bytes. In UTF-8, the standard ASCII characters occupy only one byte, and remain untouched by the encoding (i.e., a string of ASCII characters is a legal UTF-8 string). As a tradeoff, however, other Unicode characters occupy two or three bytes.

In UTF-8, Unicode characters between \u0000 and \u007F occupy a single byte, which has a value of between 0x00 and 0x7F, and which always has its high-order bit set to 0. Characters between \u0080 and \u07FF occupy two bytes, and characters between \u0800 and \uFFFF occupy three bytes. The first byte of a two-byte character always has high-order bits 110, and the first byte of a three-byte character always has high-order bits 1110. Since single-byte characters always have 0 as their high-order bit, the one-, two-, and three-byte characters can easily be distinguished from each other.

The second and third bytes of two- and three-byte characters always have high-order bits 10, which distinguishes them from one-byte characters, and also distinguishes them from the first byte of a two- or three-byte sequence. This is important because it allows a program to locate the start of a character in a multi-byte sequence.

The remaining bits in each character (i.e., the bits that are not part of one of the required high-order bit sequences) are used to encode the actual Unicode character data. In the single-byte form, there are seven bits available, suitable for encoding characters up to \u007F. In the two-byte form, there are 11 data bits available, which is enough to encode values to \u07FF, and in the three-byte form there are 16 available data bits, which is enough to encode all 16-bit Unicode characters. Table 11.2 summarizes the UTF-8 encoding.

The UTF-8 Encoding
Start Character	End Character	Required Data Bits	Binary Byte Sequence (`x` = data bits)
F	xxxxxxx
FF	xxxxx 10xxxxxx
uFFFF	xxxx 10xxxxxx 10xxxxxx

The UTF-8 has the following desirable features:

All ASCII characters are one-byte UTF-8 characters. A legal ASCII string is a legal UTF-8 string.
Any non-ASCII character (i.e., any character with the high-order bit set) is part of a multi-byte character.
The first byte of any UTF-8 character indicates the number of additional bytes in the character.
The first byte of a multi-byte character is easily distinguished from the subsequent bytes. Thus, it is easy to locate the start of a character from an arbitrary position in a data stream.
It is easy to convert between UTF-8 and Unicode.
The UTF-8 encoding is relatively compact. For text with a large percentage of ASCII characters, it is more compact than Unicode. In the worst case, a UTF-8 string is only 50% larger than the corresponding Unicode string.

Java actually uses a slightly modified form of UTF-8. The Unicode character \u0000 is encoded using a two-byte sequence, so that an encoded Unicode string never contains null characters.