Document Information Item
The Document Information Item is the root of the information found in an XML document. There is only one such root item.
This information item begins with the ContentHandler.startDocument() call and ends with the ContentHandler.endDocument() call. Many SAX2 event calls are used to construct its children or constituents.
Because text in Java is always accessed using UTF-16 character strings or arrays, most applications won't need to worry about encoding issues; the SAX2 parser handles that. However, there are cases when encoding may matter:
- Input normalization
- Some recent XML standards require that text be normalized. For example, XML Canonicalization (as used in digital signature applications) requires the use of Unicode Normalization Form C; some other W3C specifications have the same requirement. Text originally represented in UTF-8 or UTF-16 might need further normalization to remove some deprecated character codes that can be represented using those encodings.
Such encoding data is required on a per-entity basis, not a per-document basis as implied by the Infoset specification. And for internal entity expansions or defaulted attributes, you'll need to normalize if the encoding associated with the original definition supported denormalized text.
- Output encoding
- When using an output encoding that is not based on the Unicode character set, you may not be able to represent XML names that use particular characters. For example, ASCII cannot handle element or attribute names using accented characters (used in Europe and Latin America) or using ideographic characters (used in Asia).
The preferred encoding solution is to always use UTF-8 or UTF-16 when outputting XML, so that such problems cannot occur and so that all XML processors can work with such output. Similar logic applies to display systems like window systems: prefer font rendering systems that use Unicode over those tied to some specific encoding.