Exposing DTD Information

SAX2 exposes DTD information through three different interfaces. Part of it is exposed through the LexicalHandler extension interface: the DTD's root element type declaration and boundaries of the various entities. The rest is exposed through two DTD-specific interfaces, presented here.

When you're working with streams of SAX event data, remember that all DTD event data is seen before the document data it describes. This means that if you need it inside the document, you'll need to plan ahead to save the DTD data. It also means that if you need to merge streams of event data, such DTD data may create a problem. Unless you know the DTD data in advance, you'd need to dam up the event stream until all data that needs to go into downstream DTD events is in hand. Only then can you send the events downstream (with the DTD first). Luckily, merging event streams with unknown DTD data isn't common.

DTD information is automatically used inside XML parsers when they parse XML documents. That includes expansion of conditional sections and parameter entities in DTDs, expanding general entities, and normalizing or defaulting attributes. Most DTD validation can be cleanly layered on top of SAX2 since these declaration callbacks provide all the most important information.[20] SAX2 enables application-level processing of DTD constraints; the only internal support it provides for DTDs is a feature flag to expose parser support for validation. When applications need to construct valid documents, they can use DTD information as they make changes, instead of needing to save the document and reparse the whole thing.

[20]The exceptions relate to lexical constraints that should arguably be well-formedness constraints. Entity nesting is supposed to match nesting of grammatical constructs within DTDs; that's a validity constraint. However, the analogous constraint in a document body affects well-formedness instead.

The support for working with DTDs provided by most XML tools is not as good as the support provided by SAX2. For example, DOM Level 2 provides weaker support, and the TRAX support for SAX (java.xml.transform.sax) doesn't support DeclHandler at all.

Note that while a fully featured SAX2 parser will let you re-create the internal subset, it will not let you round-trip any external parameter entities. That's because parameter entities will be expanded. You will not see conditional sections in external PEs, or declarations being built up from parameter entities. Instead, you'll see the actual declarations that apply to your documents. This may help you to understand exactly what a complex DTD is doing.

The DeclHandler Interface

This extension interface is new in SAX2. It's in the org.xml.sax.ext package, which means among other things that it is optional and not all SAX APIs support it. (DefaultHandler is one example of an API that does not.) However, any SAX2 parser that can be bootstrapped with JAXP must support this interface. There is no setDeclHandler() method; bind these handlers to parsers like this:

XMLReader producer = ...; DeclHandler handler = ...; producer.setProperty ("http://xml.org/sax/properties/ declaration-handler",handler); // throws SAXNotSupportedException if parameter isn't a DeclHandler. // throws SAXNotRecognizedException if parser doesn't support it. 

Parsers that support DeclHandler are essential for applications that need to work with declarations of elements and attributes or with parsed entities. DOM requires such support for parsed entities, although even Level 2 hides or ignores element and attribute type data. This interface is the most common way SAX2 exposes type constraints (the primary role of a Document Type Declaration) from DTDs, so if you need to see those constraints, you'll use this handler. It has four API callbacks:

The DTDHandler Interface

The DTDHandler interface was carried unchanged from SAX1 into SAX2 and is primarily useful for applications that work with two specific SGML notions: notations and unparsed entities. Some DTDs, such as XML DocBook, use notations in such traditional roles. DOM also requires such support. Use XMLReader.setDTDHandler() to bind this handler to a parser. You probably won't ever need to use it for new code. On the Web, those SGML notions correspond roughly to MIME types and URIs respectively, web concepts that are much more widely understood and supported. The interface has only two API callbacks, provided to meet specific requirements in the XML 1.0 specification:

Most XML applications that care about unparsed entities and notations do so because they interface with SGML systems that use them or are migrating such systems to use the XML generation of tools. XML editors supporting this functionality might use these event callbacks to create menus of notations or unparsed entities when they are editing attributes that hold such values.

Applications that use this interface will normally use the callbacks to create two tables, keyed by entity or notation name respectively, that are used to interpret element attributes. More rarely, notations will be used to determine the operation corresponding to a given processing instruction target name. Secure applications will never use notations to directly encode system commands, but will always redirect through application controlled tables. For example, it would be foolish to rely on system IDs found in a document. System IDs such as rm -rf /, when run through a Unix or Linux shell, would remove all files accessible through the local system.