XML Validation
Along with XPath support, JAXP 1.3 added an entirely new validation framework. Previously, validation was handled by invoking setValidating( ) on either a SAXParserFactory or a DocumentBuilderFactory:
factory.setValidating(true);
This approach, while functional, left a lot to be desired. It relied on the document being parsed to specify the schema to validate against, which can be problematic; it's common for documents to omit a DOCTYPE or schema reference, and yet you still may want to validate that document against a schema you have on hand. Additionally, setValidating( ) is ambiguous as to the constraint type being used. Is the document to be validated against a DTD? an XML Schema? Can RELAX NG schemas be used? What if the document references both a DTD and XML Schema? These are all questions that prompted the creation of a new JAXP package, javax.xml.validation (shown in ).
Creating a SchemaFactory
This should already start to make some sense; classes like Schema and SchemaFactory look a lot like the SAXParser/SAXParserFactory and DocumentBuilder/DocumentBuilderFactory combinations from SAX and DOM. In fact, you beginas you do with the other JAXP factory classesby creating a new SchemaFactory via the newInstance( ) method:
SchemaFactory schemaFactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
Most of the classes in javax.xml.validation are related to internal processing; you'll usually use only SchemaFactory, Schema, and Validator

JAXP hardwires each SchemaFactory instance to a particular type of schema, so you'll need to supply this method with a constant representing the schema variant you want to use. These come from another new JAXP class, javax.xml.XMLConstants; Table 7-2 shows the constants supported for use with validation.
Table 7-2. Constants supported for use with validation
Constant name | Schema language |
---|---|
XMLConstants.RELAXNG_NS_URI | RELAX NG |
XMLConstants.W3C_XML_SCHEMA_NS_URI | XML Schema |
Where's the DTD?You'll notice a rather obvious absence: DTDs. If you read through the XML 1.0 specification, there's simply no room for external validation using DTDs. In other words, the specification is adamant that for validation against a DTD to occur, the document must specify that DTD within the document, using a DOCTYPE declaration. So, you're stuck with good old setValidating(true) for DTD validation. But, with the validation API, you can limit that approach to just DTD validation, making it a decent solution (and removing ambiguity from your code). For all other validation, use the JAXP validation API. |
There are several options on SchemaFactory, although I found that only a few of them were very useful in the "typical" validation process. The most notable of these issetErrorHandler( ), which accepts an org.xml.sax.ErrorHandler implementation. Rather than dealing with errors in parsing an XML document, this handler reports errors in processing a schema loaded via the newSchema( ) method (the subject of the next section, "Representing a Constraint Model in Java"). This is a simple way to deal with schema-loading errors gracefully, via an interface you already should be comfortable with. Another option you may want to investigate is accessed through the setResourceResolver( ) method. You can pass this method an object to handle resource resolution when parsing schemas; for example, if your schema references other schemas, or external entities, the class handed to this method can intercept resolution requests and redirect them. The downside to this methodand the main reason I tend to use it sparinglyis that the object you must supply to setResourceResolver( ) is an org.w3c.dom.ls.LSResourceResolver: a DOM 3 construct, used in the Load and Save module. Since many parsers are still coming up to speed on DOM Level 3, you may not have an implementation of this interface available to your parsing code.
|
Representing a Constraint Model in Java
Once you've created a SchemaFactory, you can then create a new Schema object, via the newSchema( ) method on your factory. As you might expect, a Schema is a Java representation of a constraint model; therefore, pass the newSchema( ) method the schema it should represent (taking care that the constraint model is in the same schema language as that supported by the SchemaFactory):
SchemaFactory schemaFactory = SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI); Source schemaSource = new StreamSource(new File(args[1])); Schema schema = schemaFactory.newSchema(schemaSource);
In this example, I've supplied the schema file in the form of a javax.xml.transform.Source implementation, something you should already be familiar with from TrAX (see Figures 7-9 and 7-10 for a refresher). You can also use a File or URL as input. You can also supply an array of Source implementations (Source[]); the SchemaFactory will combine these into a single Schema object, and return that object from newSchema( ). As you might expect, errors can abound in this situation, so be sure you have an ErrorHandler set to deal with problems that might arise when combining schemas.
Validating XML
With a Schema object created, you just need to call newValidator( ) to get what you want: an object to validate your XML. The Validator class is shown in .
Unsurprisingly most of the Validator methods are purposed to validate an XML document

Once you have a Validator, simply call validate( ):
// Create the validator Validator validator = schema.newValidator( ); // Validate validator.validate(new DOMSource(doc));
As with SchemaFactory, you can set an ErrorHandler and LSResourceResolver on your Validator to provide some more information from and control over the validation process. In fact, without an ErrorHandler to gracefully handle and report validation errors, this method will simply throw an exception if there are problemspossibly crashing your program! Even with a good try/catch block, there's no substitute for using an ErrorHandler for dealing with potential validation problems.
Fixing errors as they occur
In addition to the version of validate( ) that takes as input a Source, there is another version that takes both a Source and Result. The XML returned in the Result is described as possibly augmented XML; what that means in reality is that you're allowed to change the XML from the Source to produce a valid Result. For example, there may be some common errors in XML input documents that you are willing to silently "fix," perhaps changing a namespace URI or adding an attribute with a default value. All this is possible through your ErrorHandler implementation, which can actually change the input XML document. That turns out to be a rather clumsy approach to fixing errors, though, as ErrorHandler is really intended to report errors, not fix them. A much better approach is to use the ValidatorHandler class.
Dealing with Files Directlyvalidate( ) will not accept as input a StreamSource (or a StreamResult for output). Further, you cannot pass in (for example) a DOMSource and a SAXResult; the input and output types must match (DOM for both or SAX for both). If you want to pass in a file, or earlier convert the input to a different type of output, use the identity transformation, detailed in "The Identity Transformation." This frees up Validator to handle validation, and only validation. |
Using ValidatorHandler to customize validation processing
If you are interested in really getting to the events that underlie validation processing, you should check out the ValidatorHandler interface, which implements org.xml.sax.ContentHandler (this class is shown in ).
By implementing ContentHandler, ValidatorHandler allows fine-grained access to the SAX event chain, as parsing and validation occurs

ValidatorHandler, by way of ContentHandler, has access to all the SAX events that occur in parsing, and allows you to (almost) directly deal with XML as it is being processed. You can easily add attributes, change values (perhaps to a new default), and even work with namespace prefixes and URIs.
|
You get access to a ValidatorHandler via the Schema object's newValidatorHandler( ) method:
ValidatorHandler vHandler = schema.newValidatorHandler( );
You can then provide a resource resolver (functioning much like an EntityResolver), and of course an ErrorHandler. What you won't find is a validate( ) method, though. When you create a ValidatorHandler, it maintains an association to the Schema it was created by. So, to ensure the handler is invoked, you need to associate the Schema with a SAXParserFactory or DocumentBuilderFactory; both provide a setSchema( ) method for just this purpose:
// Load up the document DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance( ); // Set up an XML Schema validator, using the supplied schema Source schemaSource = new StreamSource(new File(args[1])); SchemaFactory schemaFactory = SchemaFactory.newInstance( XMLConstants.W3C_XML_SCHEMA_NS_URI); Schema schema = schemaFactory.newSchema(schemaSource); // Instead of explicitly validating, assign the Schema to the factory factory.setSchema(schema); // Parsers from this factory will automatically validate against the // associated schema DocumentBuilder builder = factory.newDocumentBuilder( ); Document doc = builder.parse(new File(args[0]));
This may seem a bit odd; you create the ValidatorHandler, and never invoke it directly, or even assign it back to the Schema (that last stepassignment to the Schemahappens implicitly when you create the ValidatorHandler). Then, you assign the Schema to a SAX or DOM factory, and never invoke the Schema directly. But, if you get in a SAX frame of mind, this all makes a lot more sense; you're letting the factories create objects that will call into your Schema (and your ValidatorHandler) as they are needed; it's then that your code has an effect.
A Big Fat Caveat About JAXP Validation
And, t he bad news: JAXP validation is, at least as of this writing, buggy and quirky. Validation is new to JAXPreleased in JAXP 1.3 and bundled with Java 5.0and parsers are still adding support (and working out the bugs in that support). Remember that JAXP is largely a set of interfaces that parser vendors have to implement; Sun has the easier part of that deal (they spec out the API, provide a not-for-production reference implementation, and leave the production code to parser vendors). In any case, my tests with the latest versions of Xerces (2.6.3 and 2.7.1) were a bit flaky. Using Validator and validate( ) directly worked fine; assigning a Schema to a SAXParserFactory or DocumentBuilderFactory and initiating parsing sometimes worked well and sometimes bombed. I suspect that many of these problems will be worked out by the time you read this; just as DOM Level 3 defines validation APIsand recognizes it as a difficult problemJAXP defines some pretty cool, albeit complex, interfaces for validation. Try the code samples and your own programs out, and see how they do; just be patient if everything doesn't work right away.