XML Schema
XML Schema seeks to improve upon DTDs by adding more typing and quite a few more constructs than DTDs, as well as using XML as the constraint representation format. I'm going to spend relatively little time here talking about schemas, because they are a behind the scenes detail for Java and XML. In the chapters where you'll be working with schemas, I'll address any specific points you need to be aware of. However, the specification for XML Schema is so enormous that it would take up an entire tutorial of explanation on its own. As a matter of fact, XML Schema by Eric van der Vlist (Oracle) is just that: an entire tutorial on XML Schema.
XML Schema Definitions
Before getting into the actual schema constructs, take a look at a typical XML Schema root element:
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:dw="http://www.ibm.com/developerWorks/" elementFormDefault="unqualified" attributeFormDefault="unqualified" version="4.0">
There's quite a bit going on here, including two different namespace declarations. First, the XML Schema namespace itself is attached to the xsd prefix, allowing separation of XML Schema constructs from the elements and attributes being constrained. Next, the dw namespace is defined; this particular example is from the IBM DeveloperWorks XML article template, and dw is used for DeveloperWorks-specific constructs. Then, the values of attributeFormDefault and elementFormDefault are set to "unqualified". This allows XML instance documents to omit namespace declarations on elements and attributes. Qualifications are a fairly tricky idea, largely because attributes in XML do not fall into the default namespace; they must explicitly be assigned to a namespace. For a lot more on qualification, check out the relevant portion of the XML Schema specification at http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/structures.html#element-schema. Finally, the version attribute is given a value of "4.0". This is used to indicate the version of this particular schema, not of the XML Schema specification being used. The namespace assigned to the xsd prefix, http://www.w3.org/2001/XMLSchema, is actually the indicator as to which schema spec is being used, rather than an explicit version attribute.
Elements and attributes
Elements are defined with the element construct. You'll generally need to define your own data types by nesting a complexType tag within the element element, which defines the name of the element (through the name attribute). For example, here's an element definition from IBM's schema; this particular fragment constraints the code element:
<xsd:element > <xsd:annotation> <xsd:documentation xml:lang="en"> <title>Define a code listing</title> <desc>The stylesheet allows code to be inline or section. The contents of this element are displayed in a monospaced font, with all whitespace preserved from the original XML source.</desc> </xsd:documentation> </xsd:annotation> <xsd:complexType mixed="true"> <xsd:choice minOccurs="0" maxOccurs="unbounded"> <xsd:element ref="a"/> <xsd:element ref="b"/> <xsd:element ref="br"/> <xsd:element ref="font"/> <xsd:element ref="heading"/> <xsd:element ref="i"/> <xsd:element ref="sub"/> <xsd:element ref="sup"/> <xsd:group ref="specialCharacters"/> </xsd:choice> <xsd:attribute type="inline" use="required"> <xsd:annotation> <xsd:documentation xml:lang="en"> <desc>The type of code listing.</desc> </xsd:documentation> </xsd:annotation> </xsd:attribute> <xsd:attribute > <xsd:annotation> <xsd:documentation xml:lang="en"> <desc>The width in characters of this code listing.</desc> </xsd:documentation> </xsd:annotation> </xsd:attribute> </xsd:complexType> </xsd:element>
In this case, the element's name (code) is supplied, and then annotation is used to provide some basic commenting and documentation.
|
complexType simply informs the schema parser that the element is not a predefined schema type, like string or integer. Setting the mixed attribute to true lets the schema parser know that the code element can have textual content, as well as nested elements. The default value for mixed is false; you have to explicitly specify when an element has both text and subelements. Next, choice is used to supply a selection of subelements. If you omit choice and just list the elements, the order matters (elements must appear in the order that they are declared in the schema). But, by using choice, order becomes unimportant. Further, the minimum and maximum number of each element is unbounded (minOccurs="unbounded" and maxOccurs="unbounded" takes care of this). This effectively allows any number of any of these elements to appear, in any order. For each of these elements referenced (using ref), there must be a definition somewhere else in the schema (and may have its own complexType, referencing other elements). Finally, the type and width attributes are defined and annotated, using the attribute keyword. So, there should be two things to take away from this definition:
- Once you get the basic constructs in your head, it's fairly easy to read an XML Schema.
- Even the definition of very simple elements is verbose; you'll rarely see an XML Schema that's fewer than several hundred lines.
Simple types
If you did have a so-called "simple type," you can avoid the complexType construct altogether:
<xsd:element name="text-data" type="xsd:string" />
Extending base types
You'll often want the simplicity of a simple type but the flexibility of XML Schema's more advanced constraints. For example, if you were defining a colorname element, you would probably want it as a simple string:
<xsd:element type="xsd:string" />
But, you can use XML Schema's enumeration feature to ensure only certain colors are allowed. In these cases, you have to use extension; but, since you're actually restricting the base type of string, rather than expanding on it, you'd use the restriction keyword:
<xsd:simpleType > <xsd:restriction base="xsd:string"> <xsd:enumeration value="blue" /> <xsd:enumeration value="green" /> <xsd:enumeration value="red" /> </xsd:restriction> </xsd:simpleType>
On the other hand, extension is used when you're taking a base type and adding to it:
<xsd:element > <xsd:complexType> <xsd:simpleContent> <xsd:extension base=" xsd:string"> <xsd:attribute name="" type="xsd:string"/> </xsd:extension> </xsd:simpleContent> </xsd:complexType> </xsd:element>
Here, the title element is based on a simple string, but adds an additional attribute (, also a string).
|
Although I've barely scratched the surface of XML Schema, this should at least give you a rough idea of the major constructs; it's certainly enough to get you through this tutorial without too much trouble.
Generating XML Schemas from Instance Documents
You already know about Relaxer from the previous section "Generating DTDs from XML Instance Documents." The same tool works with XML Schemas, using the -xsd option:
relaxer -xsd toc.xml
You'll get an XSD file (in this case, toc.xsd). For the Eclipse table of contents, the resulting schema is shown in Example 2-3.
Example The XML Schema generated by Relaxer automatically assigns a no-URL namespace as the default, if none is specified in the instance document
<?xml version="1.0" encoding="UTF-8" ?> <xsd:schema xmlns="" xmlns:xsd="http://www.w3.org/2001/XMLSchema" targetNamespace=""> <xsd:element type="toc"/> <xsd:complexType > <xsd:sequence> <xsd:element maxOccurs="unbounded" minOccurs="1" type="topic"/> </xsd:sequence> <xsd:attribute type="xsd:token"/> </xsd:complexType> <xsd:complexType > <xsd:sequence> <xsd:element type="link"/> </xsd:sequence> <xsd:attribute type="xsd:token"/> <xsd:attribute type="xsd:token"/> </xsd:complexType> <xsd:complexType > <xsd:sequence/> <xsd:attribute type="xsd:token"/> </xsd:complexType> </xsd:schema> |
Compare this to Example 2-2, and you begin to see how verbose XML Schema really is! As in the case of autogeneration of DTDs, the more instance documents you can supply to Relaxer, the more accurate the resulting XML Schema.
Generating XML Schemas from a DTD
As the XML community moves away from DTDs to either XML Schema or RELAX NG, you'll need to convert many of your DTDs to a new constraint model. The DTD2XS tool at http://www.lumrix.net/xmlfreeware.php is perfect for just this use-case. Download the tool, and expand it to somewhere easily added to your classpath (like /usr/local/java/dtdxs). On Unix/Linux/Mac OS X:
export CLASSPATH=$CLASSPATH:/usr/local/java/dtd2xs
and on Windows:
set CLASSPATH=%CLASSPATH%;c:\java\dtd2xs
Unfortunately, you have to copy the complextype.xsl file, from the DTD2XS distribution, into the directory you're working from (or always convert from the dtdxs directory, which is equally inconvenient). Now just give the tool a DTD to convert:
[bmclaugh] java dtd2xsd toc.dtd > toc-schema.xsd dtd2xs: dtdURI file:////Users/bmclaugh/Documents/Oracle/Writing/Java and XML 3rd/subs/code/ch02/toc.dtd dtd2xs: resolveEntities true dtd2xs: ignoreComments true dtd2xs: commentLength 100 dtd2xs: commentLanguage null dtd2xs: conceptHighlight 2 dtd2xs: conceptOccurrence 1 dtd2xs: conceptRelation element attribute dtd2xs: load DTD ... done dtd2xs: remove comments from DTD ... done dtd2xs: DOM translation ... ... done dtd2xs: complextype.xsl ... done dtd2xs: add namespace ... done
|
The resulting XML Schema is shown in Example 2-4.
Example The output from DTD2XS isn't the prettiest you'll ever see, but it usually gets the job done just fine
<?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema"> <xs:element > <xs:complexType> <xs:sequence minOccurs="0"> <xs:element ref="link"/> </xs:sequence> <xs:attribute type="xs:string"/> <xs:attribute type="xs:string" use="required"/> </xs:complexType> </xs:element> <xs:element > <xs:complexType> <xs:sequence> <xs:element maxOccurs="unbounded" ref="topic"/> </xs:sequence> <xs:attribute type="xs:string" use="required"/> </xs:complexType> </xs:element> <xs:element > <xs:complexType> <xs:attribute type="xs:string" use="required"/> </xs:complexType> </xs:element> </xs:schema> |
Validating XML Against an XML Schema
Finally, you should be able to validate your documents against an XML Schema (without resorting to programming, which is detailed in later chapters). As in "Validating XML Against a DTD," xmllint does the trick. First, though, you need to reference your schema in your instance document; this is quite a bit different from using a DOCTYPE definition, though.
Referencing a schema for nonnamespaced documents
If you're not using namespaces in the instance document, here's what you'd use:
<dw-document xsi:noNamespaceSchemaLocation="dw-document-4.0.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
|
dw-document is the root element here, and it defines the xsi namespace. You should always use the same URI for this declaration (http://www.w3.org/2001/XMLSchema-instance), as that's what schema-aware parsers are expecting.
|
Since there is no namespace being constrained, use the noNamespaceSchemaLocation attribute to indicate where to find the XML Schema (again, used to constrain all portions of the document not in a namespace).
Referencing a schema for namespaced documents
If you are using namespaces, you'll need to pair each namespace with a schema to validate against:
<dw-document xmlns="http://www.ibm.com/developerWorks" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.ibm.com/developerWorks dw-document-4.0.xsd">
schemaLocation is used, instead of noNamespaceSchemaLocation, and it takes two arguments (separated by a space; that space appears as a line break in the printed tutorial). The first value is the namespace to constrain, and the second is the schema location.
|
Validating against a schema
Now invoke xmllint with the --schema option:
[bmclaugh] xmllint --schema dw-document-4.0.xsd index.xml --nooutindex.xml validates
Errors are reported, and you can easily fix them.