Configuring XMLReader Behavior

A configuration mechanism was one of the key features added in the SAX2 release. Parsers can support extensible sets of named Boolean feature flags and property objects. These function in similar ways, including using URIs to identify any number of features and properties. The exception model, presented in "Introducing SAX2" in "SAX2 Feature Flags" is used to distinguish the three basic types of feature or property: the current value may be read-only, read/write, or undefined. Some flags and properties may have rules about when they can be changed (typically not while parsing) or read.

Applications access property objects and feature flags through get*() and set*() methods and use URIs to identify the characteristic of interest. Since SAX does not provide a way to enumerate such URIs as supported by a parser, you will need to rely on parser documentation, or the tables in this section, to identify the legal identifiers. (Or consult the source code, if you have access to it.)

If you happen to be defining new handlers or features using the SAX2 framework, you don't have to ask for permission to define new property or feature flag IDs. Since they are identified using URIs, just start your ID with a base URI that you control. (Only the SAX maintainers would start with the http://xml.org/sax/ URI, for example.) Typically, it will be easiest to make up some HTTP URL based on a fully qualified domain name that you control. As with namespace URIs, these are used purely as identifiers rather than as locations from which data would be retrieved. (The "I" in URI stands for "identifier.")

XMLReader Properties

SAX2 defines two XMLReader calls for accessing named property objects. One of the most common uses for such objects is to install non-core event handlers. Accessing properties is like accessing feature flags, except that the values associated with these names are objects rather than Booleans:

XMLReader producer ...; String uri = ...; Object value = ...; // Try getting and setting the property try {
 System.out.println ("Initial property setting: " + producer.getProperty (uri); // if we get here, the property is supported producer.setProperty (uri, value); // if we get here, the parser set the property
}
catch (SAXNotSupportedException e) {
 // bad value for property ... maybe wrong type, or parser state System.out.println ("Can't set property: " + e.getMessage ()); System.exit (1);
}
catch (SAXNotRecognizedException e) {
 // property not supported by this parser System.out.println ("Doesn't understand property: " + e.getMessage ()); System.exit (1);
}

You'll notice the URIs for these standard properties happen to have a common prefix. This means that you can declare the prefix (http://xml.org/sax/properties/) as a constant string and construct the identifiers by string catenation.

Here are the standard properties:

http://xml.org/sax/properties/declaration-handler
This property holds an implementation of org.xml.sax.ext.DeclHandler, used for reporting the DTD declarations that aren't reported through org.xml.sax.DTDHandler callbacks or for the root element name declaration, org.xml.sax.ext.LexicalHandler callbacks. This handler is presented in "The DeclHandler Interface ".
Ælfred, Crimson, and Xerces support this property. In fact, all JAXP-compliant processors must do so.
http://xml.org/sax/properties/dom-node
Only specialized parsers will support this property: parsers that traverse DOM document nodes to produce streams of corresponding SAX events. (Typical SAX2 parsers parse XML text instead of DOM content.) When read, this property returns the DOM node corresponding to the current SAX2 callback. The property can only be written before a parse, to specify that the DOM node beginning and ending the SAX event stream need not be a org.w3c.dom.Document. This type of parser is presented later in this chapter, in "DOM-to-SAX Event Production (and DOM4J, JDOM)".
One example of such a parser is gnu.xml.util.DomParser, which is currently packaged along with the Ælfred parser. At this time, neither Crimson nor Xerces include such functionality.
http://xml.org/sax/properties/lexical-handler
This property holds an implementation of org.xml.sax.ext.LexicalHandler, used for reporting various events mostly (but not exclusively) relating to details of XML text that have no semantic or structural meaning, such as comments. This handler is presented in "Consuming SAX2 Events" in "The LexicalHandler Interface ".
Ælfred, Crimson, and Xerces support this property. In fact, all JAXP-compliant processors must do so.
http://xml.org/sax/properties/xml-string
This property returns a literal string of characters associated with the current parser callback event. Exactly which characters are returned isn't specified by SAX2. An example would be returning all the characters in the start tag of an element, including unexpanded entity and character references as well as excess whitespace and the exact type of quote characters (single, double) used to delimit attribute values. (This feature is intended to be of use when constructing certain kinds of XML editors, or DTD analyzers, that are willing to re-parse this data.)
No widely available open source SAX2 parser currently supports this property.

Applications may find it useful to define their own types of handler interfaces, assembling sequences of SAX event "atoms" into higher-level event "molecules" that incorporate essential application-level semantics (and probably some procedural validation). This is the same kind of process model used by W3C's XML schema processing model: the Post-Schema-Validation Infoset (PSVI) additions incorporate semantics suited to processing with that kind of schema. Most applications need to associate even more semantics with data than are easily captured by such simple rules (including DTDs and all types of schema). Those semantics would likely not be understood by any common XMLReader, but other kinds of SAX processing components can help manage such application-level handlers. You can see an example of this technique in Example 6-3.

XMLReader Feature Flags

The previous chapter showed how to access feature flags from SAX parsers and used the standard validation flag as the primary example. Accessing feature flags follows the same model as accessing properties, except the values are boolean not Object. There are a handful of standard SAX2 feature flags, which are all you normally need. The namespace for features is different from the namespace for properties. You can't set a property to a java.lang.Boolean value and expect to have the same effect as setting the feature flag that happens to use the same identifier.

As with properties, the URIs for these standard feature flags happen to have a common prefix: http://xml.org/sax/features/. It's good developing practice to declare the prefix as a constant and construct these feature identifiers by string catenation, helping reduce errors. Also, remember that flags aren't necessarily either settable (read/write)[17] or readable (supported); some parsers won't recognize all these flags, and in some cases these flags expose parser behaviors that don't change.

[17]SAX could support write-only flags too, but these are rarely a good idea.

The standard flags are as follows:

http://xml.org/sax/features/external-general-entities
The default value for this flag is parser-specific. When the parser is validating, and in most other cases, the flag is true, indicating that the parser reads all external entities used outside the DTD. When the flag is false, the XML parser won't expand references to external general entities, so applications won't see the entire body of documents using such entities. This value can't be changed during parsing.
Crimson and Xerces only support true for this property. (For such parsers, you can get most of the effect of setting this flag to false by using an EntityResolver that returns zero-length entities after the first startElement() event.)Ælfred supports changing the value of this property.
http://xml.org/sax/features/external-parameter-entities
The default value for this flag is parser-specific. When the parser is validating, and in most other cases, the flag is true, indicating the DTD will be completely processed. When the flag is false, the XML parser will skip any external DTD subset, as well as named external parameter entities, so it won't necessarily read the entire DTD for a document. This value can't be changed during parsing.
Skipping these entities means attributes declared in them will not be defaulted or normalized as expected, and their types won't be known. As a result, default namespace declarations may get dropped. Parts of the internal subset after a reference to a skipped external parameter entity will be ignored. It also means some general entities might not be declared, making it impossible to correctly distinguish whether references to undefined entities are well-formedness errors.
Normally, you are better off providing an entity resolver that accesses locally cached copies of your DTD components, or not using DTDs, rather than disabling processing of external parameter entities. But don't assume all the XML you work with will have these DTD entities processed; the XML processors in some web browsers will not read these entities by default.
Xerces and Crimson only support true for this property. (For such parsers, you can get an effect similar to setting this to false by using an EntityResolver that returns zero-length entities before the first startElement() event. The parser won't correctly ignore declarations found later in the DTD.) Ælfred supports changing the value of this property.
http://xml.org/sax/features/is-standalone/
This feature flag derives its value from the document being parsed, so it is read-only and only available after the first part of the document has been parsed. When the flag is true, the document has been declared to be standalone. If that declaration is correct, then all external entities may be safely ignored. This feature is part of XML 1.0 and is intended to reduce the cost of parsing some documents.
This flag should be part of an upcoming SAX extensions release.
http://xml.org/sax/features/lexical-handler/parameter-entities
The default value for this flag is parser-specific and is implicitly false if the parser doesn't support the LexicalHandler through a parser property. When the flag is true, the parser will report the beginning and end of parameter entities through LexicalHandler calls. (Skipped parameter entities are always reported, through the appropriate ContentHandler call.) Parameter entities are distinguished from general entities because the first character of their entity name will be a percent sign (%). The value can't be changed during parsing.
Currently, only theÆlfred parser reports parameter entities.
http://xml.org/sax/features/namespaces
This flag defaults to true in XML parsers, which indicates the parser performs namespace processing, reporting xmlns attributes by startPrefixMapping() and endPrefixMapping() calls and providing namespace URIs for each element or attribute. Otherwise no such processing is done at the parser level. This can't be changed during parsing.
You will leave flag this at its default setting unless your XML documents aren't guaranteed to conform to the XML Namespaces specification. Setting this to false usually gives some degree of parsing speed improvement, although it will likely not provide a significant impact on overall application performance. If you disable namespaces, make sure you first enable the namespace-prefixes feature.
This is supported by all SAX2 XML parsers.Ælfred, Crimson, and Xerces support changing the value of this property.
http://xml.org/sax/features/namespace-prefixes
This flag defaults to false in XML parsers, indicating the parser will not present xmlns* attributes in its startElement() callbacks. Unless the flag is true, parsers won't portably present the qualified names (which include the prefix) used in an XML document for elements or attributes. The value can't be changed during parsing.
If you want to see the namespace prefixes for any reason, including for generating output without further postprocessing or for performing layered DTD validation, make sure this flag is set. Also make sure this flag is set if you completely disable namespace processing (with the namespaces feature flag), because otherwise the behavior of a SAX2 parser is undefined.
This is supported by all SAX2 parsers.Ælfred, Crimson, and Xerces support changing the value of this property.
http://xml.org/sax/features/string-interning
The default value for this flag is parser-specific. When true, this indicates that all XML name strings (except those inside attribute values) and namespace URIs returned by this parser will have been interned using String.intern(). Some kind of interning is almost always done to improve the performance of parsers, and this flag exposes this work for the benefit of applications. This value can't be changed during parsing.
When applications know interning has been done, they know they can rely on fast, identity-based tests for string equality (== or !=) rather than the more expensive String.equals() method. Using equality testing for strings will always work, but it can be much slower than identity testing. Java automatically interns all string constants. Lots of startElement() processing needs to match element and attribute name strings (as sketched in Example 2-8), so this kind of optimization can often be a win.
Ælfred interns all strings. Some older versions of Crimson don't recognize this flag, but all versions should correctly intern those strings. Xerces reports that it does not intern these strings.
http://xml.org/sax/features/validation
The default value for this flag is parser-specific; in most cases it is false. When the flag is true, the parser is performing XML validation (with a DTD, unless you've requested otherwise). When the flag is false, the parser isn't validating. The value can't be changed while parsing.
Ælfred, when packaged with its optional validator, Crimson, and Xerces support both settings.

A few additional standard extension features will likely be defined, providing even more complete Infoset support from SAX2 XML parsers.Ælfred also includes a nonvalidating parser, which supports only false for this flag.

Of the widely available parsers, only Xerces has nonstandard feature flags. (The Xerces distribution includes full documentation for those flags.) As a rule, avoid most of these, because they are parser-specific and even version-specific. Some are used to disable warnings about extra definitions that aren't errors. (Most parsers don't bother reporting such nonerrors; Xerces reports them by default.) Others promote noncompliant XML validation semantics. Here are a few flags that you may want to use.

http://apache.org/xml/features/validation/schema
This tells the parser to validate with W3C-style schemas. The document needs to identify a schema, and the parser must have namespaces and validation enabled. (Defaults to false.)
W3C XML schema validation does not need to be built into XML parsers. In fact, most currently available schema validators are layered.
http://apache.org/xml/features/validation/schema-full-checking
This flag controls whether W3C schema validation involves all the specified tests. By default, some of the more expensive checks are not performed; Xerces is not "fully conforming" by default.
http://apache.org/xml/features/allow-java-encodings
This flag defaults to false, limiting the encodings that the parser accepts to a handful. When the flag is set to true, more encoding names are supported. Most other SAX2 parsers effectively have true as their default. A few of those additional encoding names are Java-specific (such as "UTF8"); most of them are standard encoding names, either the preferred version or recognized alternatives.
http://apache.org/xml/features/continue-after-fatal-error
When set, this flag permits Xerces to continue parsing after it invokes ErrorHandler.fatalError() to report a nonrecoverable error. If the error handler doesn't abort parsing by throwing an exception, Xerces will continue. The XML specification requires that no more event data be reported after fatal errors, but it allows additional errors to be reported. (Of course, depending on the initial error, many of the subsequent reports might be nonsense.)