Validating Documents

"Words, words, mere words, no matter from the heart."

William Shakespeare, Troilus and Cressida

In this section, we talk about DTDs and XML Schema, two ways to enforce rules in an XML document. A DTD is a simple grammar for an XML document, defining which tags may appear where, in what order, with what attributes, etc. XML Schema is the next generation of DTD. With XML Schema, you can describe the data content of the document as well as the structure. XML Schemas are written in terms of primitives, such as numbers, dates, and simple regular expressions, and also allow the user to define complex types in a grammar-like fashion. The word schema means a blueprint or plan for structure, so we'll refer to DTDs and XML Schema collectively as schema where either applies. The term XML Schema is also a bit overloaded because there are now several competing languages (syntaxes) for describing XML in addition to the official W3C XML Schema standard. DTDs, although much more limited in capability, are still currently more widely used. This may be partly due to the complexity involved in writing XML Schemas by hand. The W3C XML Schema standard is verbose and cumbersome, which may explain why several alternative syntaxes have sprung up. In Java 5.0, a new javax.xml.validation API was added to standardize XML validation in a pluggable way. Out of the box, Java 5.0 supports only DTDs and W3C XML Schema, but new schema languages can be added in the future.

Using Document Validation

XML's validation of documents is a key piece of what makes it useful as a data format. Using a schema is somewhat analogous to the way Java classes enforce type checking in the language. A schema defines document types. Documents conforming to a given schema are often referred to as instance documents of the schema. This type safety provides a layer of protection that eliminates having to write complex error-checking code. However, validation may not be necessary in every environment. For example, when the same tool generates XML and reads it back, validation should not be necessary in normal operation. It is invaluable, though, during development. Often, document validation is used during development and turned off in production environments.

DTDs

The DTD language is fairly simple. A DTD is primarily a set of special tags that define each element in the document and, for complex types, provide a list of the elements it may contain. The DTD <!ELEMENT> tag consists of the name of the tag and either a special keyword for the data type or a parenthesized list of elements.

 <!ELEMENT Name ( #PCDATA )>
 <!ELEMENT Document ( Head, Body )>


The special identifier #PCDATA (parsed character data) indicates a string. When a list is provided, the elements are expected to appear in that order. The list may contain sublists, and items may be made optional using a vertical bar (|) as an OR operator. Special notation can also be used to indicate how many of each item may appear; two examples of this notation are shown in Table 24-4.

Table 24-4. DTD notation defining occurrences

Character

Meaning

*

Zero or more occurrences

?

Zero or one occurrences


Attributes of an element are defined with the <!ATTLIST> tag. This tag enables the DTD to enforce rules about attributes. It accepts a list of identifiers and a default value:

 <!ATTLIST Animal animalClass (unknown | mammal | reptile) "unknown">


This ATTLIST says that the Animal element has a class attribute that can have one of three values: unknown, mammal, or reptile. The default is unknown. We won't cover everything you can do with DTDs here. But the following example will guarantee zooinventory.xml follows the format we've described. Place the following in a file called zooinventory.dtd (or grab this file from the DVD or web site for the tutorial):

 <!ELEMENT Inventory ( Animal* )>
 <!ELEMENT Animal (Name, Species, Habitat, (Food | FoodRecipe), Temperament)>
 <!ATTLIST Animal animalClass (unknown | mammal | reptile) "unknown">
 <!ELEMENT Name ( #PCDATA )>
 <!ELEMENT Species ( #PCDATA )>
 <!ELEMENT Habitat ( #PCDATA )>
 <!ELEMENT Food ( #PCDATA )>
 <!ELEMENT FoodRecipe ( Name, Ingredient+ )>
 <!ELEMENT Ingredient ( #PCDATA )>
 <!ELEMENT Temperament ( #PCDATA )>


The DTD says that an Inventory consists of any number of Animal elements. An Animal has a Name, Species, and Habitat tag followed by either a Food or FoodRecipe. FoodRecipe's structure is further defined later. Normally, to use a DTD, we associate it with the XML document. We can do this by placing a DOCTYPE declaration in the XML document itself. In Java 5.0, the new validation API can be used to validate arbitrary XML (whether it be an in-memory DOM representation or a file stream) against any kind of schema, including DTDs. We'll cover that API after we discuss XML Schema. In this case, when a validating parser encounters the DOCTYPE, it attempts to load the DTD and validate the document. There are several forms the DOCTYPE can have, but the one we'll use is:

 <!DOCTYPE Inventory SYSTEM "zooinventory.dtd">


Both SAX and DOM parsers can automatically validate documents as they read them, provided that the documents contain a DOCTYPE declaration. However, you have to explicitly ask the parser factory to provide a parser that is capable of validation. To do this, just set the validating property of the parser factory to true before you ask it for an instance of the parser. For example:

 SAXParserFactory factory = SAXParserFactory.newInstance( );
 factory.setValidating( true );


This setValidating( ) method is an older, simplistic way to enable validation of documents that contain DTD references. As you can see, it is tied to the parser. The new validation package that we'll discuss later is independent of the parser and more flexible. You should not use the parser-validating method in combination with the new validation API unless you want to validate documents twice for some reason. Try inserting the setValidating( ) line in our model builder example after the factory is created. Abuse the zooinventory.xml file by adding or removing an element or attribute and see what happens when you run the example. You should get useful error messages from the parser indicating the problems and parsing should fail. To get more information about the validation, we can register an org.xml.sax.ErrorHandler object with the parser, but by default, Java installs one that simply prints the errors for us.

XML Schema

Although DTDs can define the basic structure of an XML document, they don't provide a very rich vocabulary for describing the relationships between elements and say very little about their content. For example, there is no reasonable way with DTDs to specify that an element is to contain a numeric type or even to govern the length of string data. The XML Schema standard addresses both the structural and data content of an XML document. It is the next logical step and it or one of the competing schema languages with similar capabilities should replace DTDs in the future. XML Schema brings the equivalent of strong typing to XML by drawing on many predefined primitive element types and allowing the user to define new complex types of her own. These schemas even allow for types to be extended and used polymorphically, like types in the Java language. Although we can't cover XML Schema in any detail, here's the equivalent W3C XML Schema for our zooinventory.xml file:

 <?xml version="1.0" encoding="UTF-8"?>
 <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
  
 <xs:element >
 <xs:complexType>
 <xs:sequence>
 <xs:element maxOccurs="unbounded" ref="Animal"/>
 </xs:sequence>
 </xs:complexType>
 </xs:element>
  
 <xs:element type="xs:string"/>
  
 <xs:element >
 <xs:complexType>
 <xs:sequence>
 <xs:element ref="Name"/>
 <xs:element type="xs:string"/>
 <xs:element type="xs:string"/>
 <xs:choice>
 <xs:element type="xs:string"/>
 <xs:element ref="FoodRecipe"/>
 </xs:choice>
 <xs:element type="xs:string"/>
 </xs:sequence>
  
 <xs:attribute default="unknown">
 <xs:simpleType>
 <xs:restriction base="xs:token">
 <xs:enumeration value="unknown"/>
 <xs:enumeration value="mammal"/>
 <xs:enumeration value="reptile"/>
 </xs:restriction>
 </xs:simpleType>
 </xs:attribute>
 </xs:complexType>
 </xs:element>
  
 <xs:element >
 <xs:complexType>
 <xs:sequence>
 <xs:element ref="Name"/>
 <xs:element maxOccurs="unbounded" type="xs:string"/>
 </xs:sequence>
 </xs:complexType>
 </xs:element>
  
 </xs:schema>


This schema would normally be placed into an XML Schema Definition file, which has a .xsd extension. The first thing to note is that this schema file is a normal, well-formed XML file that uses elements from the W3C XML Schema namespace. In it we use nested element declarations to define the elements that will appear in our document. As with most languages, there is more than one way to accomplish this task. Here, we have broken out the "complex" Animal and FoodRecipe elements into their own separate element declarations and referred to them in their parent elements using the ref attribute. In this case, we did it mainly for readability; it would have been legal to have one big, deeply nested element declaration. However, referring to elements by reference in this way also allows us to reuse the same element declaration in multiple places in the document, if needed. Our Name element is a small example of this. Although it didn't do much for us here, we have broken out the Name element and referred to it for both the Animal/Name and the FoodRecipe/Name. Control directives like sequence and choice allow us to define the structure of the child elements allowed and attributes like minOccurs and maxOccurs let us specify cardinality (how many instances). The sequence directive says that the enclosed elements should appear in the specified order (if they are required). The choice directive allows us specify alternative child elements like Food or FoodRecipe. We declared the legal values for our animalClass attribute using a restriction declaration and enumeration tags.

Simple types

Although we've not really exercised it here, the type attribute of our elements touches on the standardization of types in XML Schema. All of our "text" elements specify a type xs:string, which is a standard XML Schema string type (kind of like PCDATA in our DTD). There are many other standard types covering things such as dates, times, periods, numbers, and even URLs. These are called simple types (though some of them are not so simple) because they are standardized or "built-in." Table 24-5 lists W3C Schema simple types and their corresponding Java types. The correspondence will become useful later when we talk about JAXB and automated binding of XML to Java classes:

Table 24-5. W3C Schema simple types

Schema element type

Java type

Example

xsd:string

java.lang.String

"This is text"

xsd:boolean

boolean

true, false, 1, 0

xsd:byte

byte

 

xsd:unsignedByte

short

 

xsd:integer

java.math.BigInteger

 

xsd:int

int

 

xsd:unsignedInt

long

 

xsd.long

long

 

xsd:short

short

 

xsd:unsignedShort

int

 

xsd:decimal

java.math.BigDecimal

 

xsd:float

float

 

xsd:double

double

 

xsd:Qname

javax.xml.namespace.QName

funeral:corpse

xsd:dateTime

java.util.Calendar

2004-12-27T15:39:05.000-06:00

xsd:base64Binary

byte[]

PGZv

xsd:hexBinary

byte[]

FFFF

xsd:time

java.util.Calendar

15:39:05.000-06:00

xsd:date

java.util.Calendar

2004-12-27

xsd:anySimpleType

java.lang.String

 


For example, suppose we want to add a floating point Weight element like this to our Animal:

 <Weight>400.5</Weight>


We could now validate it in our schema by inserting the following entry at the appropriate place:

 <xs:element type="xs:double"/>


In addition to enforcing that the content of elements matches these simple types, XML Schema can give us much more control over the text and values of elements in our document using simple rules and regular expression-like patterns.

Complex types

In addition to the predefined simple types listed in Table 24-5, we can define our own, complex types in our schema. Complex types are element types that have internal structure and possibly child elements. Our Inventory, Animal, and FoodRecipe elements all have complex types and their content must be declared with the complexType tag in our schema. Complex type definitions can be reused, similar to the way that element definitions can be reused in our schema. That is, we can break out a complex type definition and give it a name. We can then refer to that type by name in the type attributes of other elements. Since all of our complex types were only used once, in their corresponding elements, we didn't give them names. They were considered anonymous type definitions, declared and used in the same spot. For example, we could have separated our Animal's type from its element declaration:

 <xs:element >
 <xs:complexType>
 <xs:sequence>
 <xs:element maxOccurs="unbounded" type="AnimalType"/>
 </xs:sequence>
 </xs:complexType>
 </xs:element>
  
 <xs:complexType >
 <xs:sequence>
 <xs:element ref="Name"/>
 <xs:element type="xs:string"/>
 <xs:element type="xs:string"/>
 ...


Declaring the AnimalType separately from the instance of the Animal element declaration would allow us to have other, differently named elements with the same internal structure. For example, our Inventory element may hold another element, MainAttraction, which is a type of Animal with a different tag name. The distinction between elements and their type definitions will also be important later when working with JAXB. There's a lot more to say about W3C XML Schema and they can get quite a bit more complex than our simple example. However, you can do a lot with the few pieces we've shown above. Some tools are available to help you get started. We'll talk about one called Trang in a moment. For more information about XML Schema, see the W3C's site (http://www.w3.org/XML/Schema) or XML Schema by Eric van der Vlist (Oracle). In the next section, we'll show how to validate a file or DOM against the XML Schema we've just created, using the new validation API.

Trang

Many tools can help you write XML Schema. One helpful tool is called Trang (http://www.thaiopensource.com/relaxng/trang.html). It is part of an alternative schema language project called RELAX NG (which we mention later in this chapter), but Trang is very useful in and of itself. It is an open source tool that cannot only convert between DTDs and XML Schema but also create a rough DTD or XML Schema by reading "example" XML documents. This is a great way to sketch out a basic, example schema for your documents.

The Validation API

To use our example XML schema, we need to exercise the new javax.xml.validation API. As we said earlier, the validation API is an alternative to the simple, parser-based validation supported through the setValidating( ) method of the parser factories. To use the validation package, we create an instance of a SchemaFactory, specifying the schema language. We can then validate a DOM or stream source against the schema. The following example, Validate, is another simple command-line utility that you can use to test our your XML and schemas. Just give it the XML filename and an XML Schema file (.xsd file) as arguments:

 import javax.xml.XMLConstants;
 import javax.xml.validation.*;
 import org.xml.sax.*;
 import javax.xml.transform.sax.SAXSource;
 import javax.xml.transform.Source;
 import javax.xml.transform.stream.StreamSource;
  
 public class Validate
 {
 public static void main( String [] args ) throws Exception
 {
 if ( args.length != 2 ) {
 System.err.println("usage: Validate xmlfile.xml xsdfile.xsd");
 System.exit(1);
 }
 String xmlfile = args[0], xsdfile = args[1];
  
 SchemaFactory factory =
 SchemaFactory.newInstance( XMLConstants.W3C_XML_SCHEMA_NS_URI);
 Schema schema = factory.newSchema( new StreamSource( xsdfile ) );
 Validator validator = schema.newValidator( );
  
 ErrorHandler errHandler = new ErrorHandler( ) {
 public void error( SAXParseException e ) { System.out.println(e); }
 public void fatalError( SAXParseException e ) { System.out.println(e); }
 public void warning( SAXParseException e ) { System.out.println(e); }
 };
 validator.setErrorHandler( errHandler );
  
 try {
 validator.validate( new SAXSource(
 new InputSource("zooinventory.xml") ) );
 } catch ( SAXException e ) {
 // Invalid Document, no error handler
 }
 }
 }
  


The schema types supported initially are listed as constants in the XMLConstants class. Right now, only W3C XML Schema is implemented but DTDs should be and there is also another intriguing type in there that we'll mention later. Our validation example follows the pattern we've seen before, creating a factory, then a Schema instance. The Schema represents the grammar and can create Validator instances that do the work of checking the document structure. Here, we've called the validate( ) method on a SAXSource, which comes from our file, but we could just as well have used a DOMSource to check an in-memory DOM representation:

 validator.validate( new DOMSource(document) );


Any errors encountered will cause the validate method to throw a SAXException, but this is just a coarse means of detecting errors. More generally, and as we've shown in this example, we'd want to register an ErrorHandler object with the validator. The error handler can be told about many errors in the document and convey more information. When the error handler is present, the exceptions are given to it and not thrown from the validate method. The errors generated by these parsers can be a bit cryptic. Hopefully, they will improve in the future. Keep in mind that these errors may not be able to give you line numbers because the validation is not necessarily being done against a stream.

Alternative schema languages

In addition to DTDs and W3C XML Schema, several other popular schema languages are being used today. One interesting alternative that is tantalizingly referenced in the XMLConstants class is called RELAX NG. This schema language offers the most widely used features of XML Schema in a more human-readable format. In fact, it offers both a very compact, non-XML syntax and a regular XML-based syntax. RELAX NG doesn't offer the same text pattern and value validation that W3C XML Schema do. Instead these aspects of validation are left to other tools (many people consider this to be "business logic," not properly done here anyway). If you are interested in exploring other schema languages be sure to check out RELAX NG and its useful schema conversion utility, Trang.

Comments