DOM Level 3 Modules - XML - Java Programming Language

DOM Level 3 seems to be the point at which the specification maintainers got very practical about the API. While modules like Traversal, Range, Events, and HTML are nice, they're not the sorts of things you'll find yourself using every day, at least in most coding environments. However, the ability to validate a DOM tree in-memory, as well as writing out XML documents, is something you're more likely to need every hour, let alone just once in a while. Fortunately, these key improvements are being adapted fairly quickly, so expect widespread DOM Level 3 support within the next year.

Load and Save

Personally, I'm as excited about the Load and Save module as I am about anything that has come out of DOM since the first version of this tutorial came out seven years ago. In short, this module allows you to excise the following line from your code, once and for all:

import org.apache.xerces.parsers.DOMParser;

Now, I'm as much a fan of Xerces as anyone, but I just don't like vendor-specific code in my classes. I'd much rather configure code with system properties, and be able to change parsers, processors, and the like all on the fly. Load and Save (LS) fills this need nicely.

Reading XML documents

There are quite a few classes involved in loading a DOM tree; they're all in the org.w3c.dom.ls package, and shown in .

DOM Load and Save module

I'm not going to cover every nuance of each of these, but instead will focus on what most of you care about: loading a DOM tree without using Xerces (or some other parser) directly. First, find an instance of org.w3c.bootstrap.DOMImplementationRegistry ; this class was covered in the last chapter and is critical for using the LS module:

DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance( );

Remember, you can request DOM implementations from this registry via the getDOMImplementation( ) method; just use the "LS" string to get an LS-capable implementation.

DOMImplementationSource impl = registry.getDOMImplementation("LS");

You also need to case the returned object (a DOMImplementationSource) to the LS-specific type, org.w3c.dom.ls.DOMImplementationLS , so you can access its methods:

DOMImplementationLS lsImpl =
 (DOMImplementationLS)registry.getDOMImplementation("LS");

So far, this is easy. Now, though, you need to create a new DOM parser; this is done via the createLSParser( ) method, available on DOMImplementationLS. You must supply this method with two arguments: a mode (either DOMImplementationLS.MODE_SYNCHRONOUS or DOMImplementationLS.MODE_ASYNCHRONOUS), and a schema type:

LSParser parser = lsImpl.createLSParser(DOMImplementationLS.MODE_SYNCHRONOUS,
 null);

If you're not sure what type of schema you'll useor just want a little bit of extra flexibilityuse null for this second argument.

Compatibility All Over the Place

Even though Xerces supports the DOM Level 3 core and LS module, it still doesn't provide asynchronous parsing. And, of course, you won't even get this far if your parser isn't DOM3-compatible. Be sure you use the hasFeature( ) method (detailed in Example 6-1) with "XML" and "3.0" before trying to compile or test any of this code. Additionally, you can test for asynchronous support in the LS module with "LS-Async" and "3.0".

In SAX parsing () and JAXP parsing (which we'll discuss in the next chapter), you'd set features and properties to determine how the parser works: should it validate? What about handling errors? In the LS module, though, these options are handled by the org.w3c.dom.DOMConfiguration class (another new DOM 3 class, this one in the core module). Here's the interface definition for this class:

package org.w3c.dom;
public interface DOMConfiguration {
 public void setParameter(String name, Object value)
 throws DOMException;
 public Object getParameter(String name)
 throws DOMException;
 public boolean canSetParameter(String name, Object value);
 public DOMStringList getParameterNames( );
}

Pretty simple, right? You can work with this object like this:

// Set options on the parser DOMConfiguration config = parser.getDomConfig( );
config.setParameter("validate", Boolean.TRUE);

The feature is called "validate", not "validation"; you will get a FEATURE_NOT_FOUND exception if you mistype this feature name.

There are a lot more options, ranging from setting an error handler to namespace handling. You can read about the complete list, as defined in the DOM specification, online at http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#DOMConfiguration.

The relationships between your validation setting, normalization settings, and the type of schema being used are detailed at http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/configuration-settings.html. This is required reading if you're going to validate using the LS module.

Finally, you can parse XML using the LSParser:

Document doc = parser.parseURI(args[0]);

When you run this code, you need to specify to your parser and app what DOMImplementationRegistry to use. You can do this in your code:

System.setProperty(DOMImplementationRegistry.PROPERTY,
 "org.apache.xerces.dom.DOMXSImplementationSourceImpl");

or via a system property:

java -Dorg.w3c.dom.DOMImplementationSourceList=
 org.apache.xerces.dom.DOMXSImplementationSourceImpl
 // rest of command line options and class name...

If you use the code version to set this property, make sure it's in some sort of init( ) method, or called before any other DOM Level 3 parsing code.

Using this code to parse the XHTML from Ajaxian.com (mentioned earlier, in the "Traversal" section), you'd get a validated DOM tree. If, however, there are validation errors, they will be reported:

[Error] Ajaxian-05242005.xhtml:573:48: Attribute "alt" is required and must be specified for element type "img".
[Error] Ajaxian-05242005.xhtml:577:14: Attribute "alt" is required and must be specified for element type "img".

You can also specify both the type of constraints to use, and the location of those constraints, via the setParameter( ) method on DOMConfiguration:

config.setParameter("schema-type", "http://www.w3.org/2001/XMLSchema");
config.setParameter("schema-location", "dw-document-4.0.xsd");

Use a value of "http://www.w3.org/TR/REC-xml" for the schema-type parameter to validate against DTDs.

This should all look a lot like what you used to do with Xerces's DOMParser; now, though, you have untethered your code from that specific parser, which is the whole point of the LS module. You can also use the parse( ) method, which takes as input an LSInput object. LSInput offers a little more flexibility in terms of the types of input you can use for XML:

LSInput input = lsImpl.createLSInput( );
input.setCharacterStream(new FileReader(new File(args[0])));
Document doc = parser.parse(input);

You can also supply LSInput a String (when you have your XML all in one glob) or an InputStream; you can also explicitly set the encoding of LSInput, adding even more flexibility. For simple apps, though, parseURI( ) works just fine.

Writing XML documents

Once you understand how loading documents works, saving them is almost trivial to describe. Here's a code fragment that handles that task:

Document doc = parser.parseURI(args[0]); /* Perform whatever operations on the DOM tree you want */ /* Serialize the document */ LSSerializer serializer = lsImpl.createLSSerializer( );
LSOutput output = lsImpl.createLSOutput( );
output.setCharacterStream(new FileWriter(new File(args[1])));
serializer.write(doc, output);

LSSerializer is the saving equivalent of LSParser, and is obtained the same way: via a factory method on your DOMImplementationLS object. In the same fashion, LSOutput complements LSInput and can accept an OutputStream, Writer, or system ID to which to write. With those two classes in hand, there's not much left to add, other than mentioning some of the options you may want to set for output (again, using the DOMConfiguration object); here are a few samples:

/* Serialize the document */ LSSerializer serializer = lsImpl.createLSSerializer( );
// Set output options config = serializer.getDomConfig( );
// Convert CDATA sections to normal escaped text config.setParameter("cdata-sections", false);
// Remove any comments config.setParameter("comments", false);
LSOutput output = lsImpl.createLSOutput( );
output.setCharacterStream(new FileWriter(new File(args[1])));
serializer.write(doc, output);

As mentioned in the section on "Reading XML documents" the complete list of options is online at http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#DOMConfiguration.

Validation

When it comes to the Validation module, I'm in the unfortunate position of having to demonstrate how it should work. Even Xerces, which is about as bleeding edge as parsers get, doesn't provide support for the Validation module as of this writing (Xerces 2.8.0). So, take the instructions and code in this section as guidelines, and realize that some things may change even as this tutorial goes to press. The classes that make up this module are all stored in the org.w3c.dom.validation package, and are shown in . There is one exception, and just four additions, so it's not much in terms of additional API to master.

DOM Validation module

To see if your parser supports validation, use the hasFeature( ) method with the strings "Validation" and "3.0" (or simply run DOMModuleChecker, shown back in Example 6-1).

Validation depends on the core DOM Level 3 module, and uses some of DOM Level 3's new constructs; that's no big deal, though, as obviously any parser that supports validation will surely support the core of DOM Level 3.

Node types supporting validation

In validation-compatible parsers, the four validation interfaces will be implemented by their corresponding core DOM counterparts:

org.w3c.dom.validation.NodeEditVAL by org.w3c.dom.Node
org.w3c.dom.validation.DocumentEditVAL by org.w3c.dom.Document
org.w3c.dom.validation.ElementEditVAL by org.w3c.dom.Element
org.w3c.dom.validation.CharacterDataEditVAL by org.w3c.dom.CharacterData

This works much in the same way that the Document interface in DOM might implement DocumentTraversal for traversal support or DocumentView for view support. The CharacterData interfacewhich may be unfamiliar to some of youis extended by the Text and Comment node types, and then CDATASection extends Text, so you actually get a three-for-one on support for validation via the CharacterDataEditVAL interface. Putting this all together, the following node types all are affected by the Validation module:

Node
Document
Element
Text
Comment
CDATASection

Although comments aren't usually something that you worry about with validation, you should see that these other node types make up the whole of what validation concerns itself with (as opposed to, for example, the DocumentType or ProcessingInstruction node types). Of course, even these are tangentially affected, as they extend the core Node interface, which in turn extends NodeEditVAL.

Enforcing validity as you work

The method that most developers will immediately grab for in validation-compliant parsers is setContinuousValidityChecking( ), on the DocumentEditVAL interface (which would make it available on the Document interface in validation-compliant parsers). By turning this feature on, you ensure that nothing can be added to your DOM tree that would make the tree invalid:

DocumentEditVAL doc = (DocumentEditVAL) getDocument( );
// Ensure the DOM tree always stays valid doc.setContinuousValidityChecking(true);

You should be aware of some subtle side effects of this setting, though. For example, consider how you might build up a portion of a DOM tree:

Element name = doc.createElement("name");
rootElement.appendChild(name);
Element firstName = doc.createElement("firstName");
Element lastName = doc.createElement("lastName");
name.appendChild(firstName);
name.appendChild(lastName);
Text firstNameText = doc.createTextNode("Pete");
Text lastNameText = doc.createTextNode("Huttlinger");
firstName.appendChild(firstNameText);
lastName.appendChild(lastNameText);

While this code would work in DOM Level 2or DOM Level 3 without validationit would not work in cases where continuous validity checking was turned on (and then was any sort of useful schema in place, of course). For example, you would expect a schema to require that a name element have a firstName and lastName; but when the name element is added to the DOM tree, it has no children. The firstName and lastName children are added after name has been inserted into the DOM tree. As a result, you'd get a validity exception (ExceptionVAL) when you tried to execute this line:

rootElement.appendChild(name);

To avoid this problem, you have to build any fragments you want to add to the tree in their entirety, and then add them to your tree. In this example, that's a fairly trivial change:

Element name = doc.createElement("name");
Element firstName = doc.createElement("firstName");
Element lastName = doc.createElement("lastName");
Text firstNameText = doc.createTextNode("Pete");
Text lastNameText = doc.createTextNode("Huttlinger");
firstName.appendChild(firstNameText);
lastName.appendChild(lastNameText);
name.appendChild(firstName);
name.appendChild(lastName);
// Add the completed subtree to the DOM tree last rootElement.appendChild(name);

However, in large documents where an element might have 10, 15, or 20 required childrenand each of those children may have several required elements and attributes as wellthis can be extremely cumbersome. Be sure that when you turn on continuous validity checking, you're really ready to build your DOM trees in this manner.

Checking for valid operations

Most of the methods on these new interfaces are concerned with checking an operation for validity; in other words, you would use a method to see if an operation is valid before actually performing that operation. So, you might want to see if it's legal to set the value of a Text node to "Lady, by yonder blessed moon I vow,//That tips with silver all these fruit-tree tops":

// Get this text node, in whatever fashion your DOM tree provides Text proseTextNode = getProseTextNode( );
String proseText = "Lady, by yonder blessed moon I vow,//" +
 "That tips with silver all these fruit-tree tops";
// See if entering this data is valid if (proseTextNode.canSetData(proseText) == NodeEditVAL.VAL_TRUE)
 proseTextNode.setValue(proseText);

This is pretty simple code to understand. After two chapters of DOM, you're probably used to having to use constant-based comparisons, rather than Boolean comparisons, so nothing is new there.

Constant-based comparisons, as well as somewhat silly class nameslike NodeEditVAL and ElementEditVALare all results of DOM being cross-platform.

In the case of the canXXX( ) methods, though, constant-based comparison is necessary. In some cases, those methods return neither true nor false; it is possible for a DOM parser to return a third value, NodeEditVAL.VAL_UNKNOWN. The specification isn't clear about why this might happen, but I can see several cases where an unknown value is likely:

There is no schema (DTD, XSD, etc.) available to validate against.
Dependencies in the schema make validation of the data indefinite or ambiguous.

Also realize that sometimes a value of NodeEditVAL.VAL_FALSE indicates that it's not (just) this data that might be invalid, but other parts of the document. If you don't have the DOM tree constantly enforcing validity, then changes to this particular node can have ripple effects that invalidate other parts of the DOM tree. These get pretty tricky, and the specification isn't at all clear about how to handle these situations.

Even more concerning, the validation specification isn't clear on what the parser's responsibility is when canXXX( ) is invoked. Is it just to validate the new data with respect to the current node? Is it to validate the entire document? And what happens in cases where the document as a whole is invalid, and no data passed to canXXX( ) would return VAL_TRUE? These are all seriousand very complexissues; I suspect they're also why there aren't any validation-capable parsers yet.

These methods are all pretty self-explanatory; I'll leave it to you, the Java Language Bindings (http://www.w3.org/TR/2004/REC-DOM-Level-3-Val-20040127/java-binding.html), and Javadoc to decipher the rest. In most cases, you'll use canInsertBefore( ), canReplaceChild( ), and canAppendChild( )all on the Node interfaceheavily in your apps. ElementEditVAL adds lots of similar methods for attributes ( canSetAttribute( ), canRemoveAttribute( ), etc.), and of course CharacterDataEditVAL does the same for text (canAppendData( ), canDeleteData( ), etc.).

All of these methods, when appropriate, have namespace-capable counterparts. So you can invoke canSetAttribute( ), or in namespace-aware apps, canSetAttributeNS( ).

Checking for state validity

In addition to checking for the validity of an operationbefore you undertake that operationyou can also check the validity of a DOM tree at a given moment. This is the in-memory validation that developers have been clamoring about for a while. The easiest way to check a document's validity is with the new validateDocument( ) method, available through the DocumentEditVAL interface:

// Get the document in some business-specific manner DocumentEditVAL doc = (DocumentEditVAL) getDocument( );
if (doc.validateDocument( ) == NodeEditVAL.VAL_TRUE) {
 // Go ahead and serialize the document
} else {
 // Report errors and repeat the cycle
}

This is pretty basic, and incredibly useful. In fact, if the validation module offered this functionality alone, developers would be pretty happy, I imagine. It's also particularly useful for the app shown above: ensuring validity before serializing a DOM tree to persistent storage. What's more interestingand adds more complexity in, once againis that the specification defines a similar method on the NodeEditVAL interface:

public short nodeValidity(short valType);

You can supply four values to this method:

NodeEditVAL.VAL_SCHEMA: Perform what would be considered "normal" validation; check this node and all its children (elements, text, etc.) for validity.
NodeEditVAL.VAL_INCOMPLETE: This is similar to VAL_SCHEMA, but only validates the current node and its immediate children. The children of those children are ignored.

Remember that in DOM, Text nodes are nested within Element nodes. So VAL_INCOMPLETE would ensure that the child elements of the current element are valid, but would ignore the textual content of those child elements. If you're not careful, you can get really deceptive results here.

NodeEditVAL.VAL_WF: This simply checks to make sure the current node is well-formed.
NodeEditVAL.VAL_NS_WF: This checks to see if the current node is well-formed and follows namespace rules properly (calling VAL_NS_WF effectively calls VAL_WF as well).

It's in VAL_SCHEMA that things get nasty (as mentioned in the last section). Because constraint models like XML Schema allow for some pretty advanced dependencies, it's possibleand even probable if you're really using XSD to its fullestthat you're going to get VAL_UNKNOWN over and over when trying to validate individual nodes. It's simply a very difficult problem to deal with nodes out of the context of an entire document; if you're unsure about validity, I strongly urge you to simply use validateDocument( ), and not mess with these rather ambiguous node-specific approaches to validation.