A DTD defines how data is formatted. It must define each allowed element in an XML document, the allowed attributes, andwhen appropriatethe acceptable attribute values for each element; it also indicates the nesting and occurrences of each element, and any external entities. DTDs can specify many other things about an XML document, but these basics are what I'll focus on here.

Java Tip This chapter is by no means an extensive treatment of DTDs, XML Schema, or RELAX NG. For more detail on all of these schema types, check out XML in a Nutshell by Elliotte Rusty Harold and W. Scott Means (Oracle), and RELAX NG by Eric van der Vlist (Oracle), both exhaustive works on XML and RELAX NG.

DTD Semantics

There's remarkably little to a DTD's semantics, although you will have to use a totally different syntax for notation than you do in XML (an annoyance corrected in both XML Schema and RELAX NG).

Elements

The bulk of the DTD is composed of ELEMENT definitions (covered in this section) and ATTRIBUTE definitions (covered in the next section). An element definition begins with the ELEMENT keyword, following the standard <! opening of a DTD tag, and then the name of the element. Following that name is the content model of the element. The content model is generally within parentheses and specifies what content can be included within the element. Take the item element, from the RSS 0.91 DTD () as an example:

<!ELEMENT item (title | link | description)*>

This says that for any item element, there may be a title element, a link element, or a description element nested within that item. The "or" relationship is indicated by the pipe ( |) symbol; the OR applies to all elements within a group, indicated by the parentheses. In other words, for the grouping (title | link | description), one and only one of title, link, or description may appear. The asterisk after the grouping indicates a recurrence. lists the complete set of DTD recurrence modifiers.

Table 2-1. When an element needs to appear multiple times, recurrence operators must be used

Operator Description
[Default] Must appear once and only once (1)
?
May appear once or not at all (0..1)
+
Must appear at least once, up to an infinite number of times (1..N)
*
May appear any number of times, including not at all (0..N)

So, revisiting the item definition, for an item, a title, link, or description may appear; but then the group can appear multiple times, meaning that you can have any number of title, link, and/or description elements within an item, and the XML would still be valid.

Java Warning This can get tricky pretty quickly if you're not used to working with these operators. All the item definition really does is say that it can have any number of title, link, and description elements, in any order. Since there is no modifier that tells DTDs that order doesn't matter, this basically becomes a hack to get around that limitation.

As an example of recurrence applying to just one element, take a look at the skipHours element definition:

<!ELEMENT skipHours (hour+)>

Here, the skipHours element must have at least one hour element within it, but has no maximum number of occurrences. You can specify ordering using the comma (,) operator:

<!ELEMENT subscribers (url, email, comment*)>

In the subscribers element, there must be one (and only one) url element, followed by a single email element, followed by zero or more comment elements. If an element has character data within it, the #PCDATA keyword is used as its content model:

<!ELEMENT title (#PCDATA)>

If an element should always be an empty element, the EMPTY keyword is used:

<!ELEMENT topic EMPTY>
Java Tip You won't find the topic or subscribers elements in the RSS 0.91 DTD. There are no elements with ordered content in that DTD, nor are there empty elements, so I made these elements up.

Attributes

Once you've handled the element definition, you'll want to define attributes. These are defined through the ATTLIST keyword. The first value is the name of the element, and then you have various attributes defined. Those definitions involve giving the name of the attribute, the type of attribute, and then whether the attribute is required or implied (which means it is not required, essentially). Most attributes with textual values will simply be of the type CDATA, as shown here:

<!ATTLIST rss
 version CDATA #REQUIRED> <!-- must be "0.91"> -->

You can also specify a set of values that an attribute must take on for the document to be considered valid (this sample is from a hypothetical DTD describing a technical tutorial):

<!ATTLIST title
 series (C | Java | Linux | Oracle | Perl | Web | Windows) #REQUIRED
>

Entities

You can specify entity reference resolution in a DTD using the ENTITY keyword. This works a lot like the DOCTYPE reference detailed in , where a public ID and/or system ID may be specified:

<!ENTITY oracleCopyright SYSTEM
 "http://www.oracle.com/copyright.xml"
>

This results in the copyright.xml file at the specified URL being loaded as the value of the Oracle copyright entity reference in a sample document that uses this reference:

<legal>
 <legal-notice>&oracleCopyright;</legal-notice>
</legal>

shows the complete RSS 0.91 DTD, so you can see several of these constructs in action.

Example The RSS 0.91 DTD is a pretty simple example of a DTD; but then, DTDs are best for simple apps

<!--
Rich Site Summary (RSS) 0.91 official DTD, proposed.
RSS is an XML vocabulary for describing metadata about websites, and enabling the display of
"channels" on the "My Netscape" website.
RSS Info can be found at http://my.netscape.com/publish/
XML Info can be found at http://www.w3.org/XML/ copyright Netscape Communications, 1999
Dan Libby - danda@bugmenot.com Based on RSS DTD originally created by Lars Marius Garshol - larsga@ifi.uio.no.
$Id$
-->
<!ELEMENT rss (channel)>
<!ATTLIST rss version CDATA #REQUIRED> <!-- must be "0.91"> -->
<!ELEMENT channel (title | description | link | language | item+ | rating? | image? | textinput? | copyright? | pubDate? | lastBuildDate? | docs? | managingEditor? | webMaster? | skipHours? | skipDays?)*>
<!ELEMENT title (#PCDATA)>
<!ELEMENT description (#PCDATA)>
<!ELEMENT link (#PCDATA)>
<!ELEMENT image (title | url | link | width? | height? | description?)*>
<!ELEMENT url (#PCDATA)>
<!ELEMENT item (title | link | description)*>
<!ELEMENT textinput (title | description | name | link)*>
<!ELEMENT name (#PCDATA)>
<!ELEMENT rating (#PCDATA)>
<!ELEMENT language (#PCDATA)>
<!ELEMENT width (#PCDATA)>
<!ELEMENT height (#PCDATA)>
<!ELEMENT copyright (#PCDATA)>
<!ELEMENT pubDate (#PCDATA)>
<!ELEMENT lastBuildDate (#PCDATA)>
<!ELEMENT docs (#PCDATA)>
<!ELEMENT managingEditor (#PCDATA)>
<!ELEMENT webMaster (#PCDATA)>
<!ELEMENT hour (#PCDATA)>
<!ELEMENT day (#PCDATA)>
<!ELEMENT skipHours (hour+)>
<!ELEMENT skipDays (day+)>
Java Tip

I've omitted the ISO Latin-l character entities for clarity.

Generating DTDs from XML Instance Documents

If you need to quickly get a DTD up and running, and already have XML on hand, you may just want to autogenerate a DTD from an XML document. Relaxer is a cool tool for doing just this (as well as generating XML Schemas and RELAX NG schemas). Download Relaxer from , and drop the archived folder somewhere accessible. Install it like this:

[bmclaugh:/usr/local/java/relaxer-1.0] sudo java -cp . setup We trust you have received the usual lecture from the local System Administrator. It usually boils down to these three things:
 #1) Respect the privacy of others.
 #2) Think before you type.
 #3) With great power comes great responsibility.
Password:
Install directory [default: /usr/local/lib/relaxer]: Command directory [default: /usr/local/bin]: [Configuration]
Install directory = /usr/local/lib/relaxer Command directory = /usr/local/bin Type "yes" to install, "no" to re-enter, "exit" to exit
> yes
Extract archives...
Generate script...
 script = /usr/local/bin/relaxer Done.
[bmclaugh:/usr/local/java/relaxer-1.0]
Java Tip On Unix or Mac OS X, you'll probably need to use sudo to install Relaxer; /usr/local/lib (if it even exists on your system) is probably only root-writable. You may also need to make the resulting installed files accessible by non-root users:
sudo chmod -R 755 /usr/local/bin

Now you can run Relaxer:

[bmclaugh:/usr/local/java/relaxer-1.0] relaxer
Copyright(c) 2000-2003 ASAMI,Tomoharu. All rights reserved.
Relaxer Version 1.0 (20031224) by asami@relaxer.org Usage: relaxer [-options] [args...]
 for more information, use -help option

To generate a DTD, just give it the name of your XML file, and specify the -dtd option:

relaxer -dtd toc.xml
Java Tip You can specify an alternate output directory using the -dir: [output dir] option. You should also ensure that the input XML has no DOCTYPE reference referring to an existing DTD, as that generally causes Relaxer to error out. Just comment out the reference if you have one already.

By default, Relaxer uses the name of your input XML as the name of the DTD, and replaces the XML extension with .dtd. is the output from this command, using the table of contents from Eclipse's documentation set as input (toc.xml is available from the online examples).

Example The DTD generated here by Relaxer is simple; Relaxer works best with simple XML, and gets progressively worse at generation as your XML gets more complicated

<!-- Generated by Relaxer 1.0 -->
<!-- Wed Jul 06 13:39:26 CDT 2005 -->
<!ELEMENT topic (link)>
<!ATTLIST topic href CDATA #IMPLIED>
<!ATTLIST topic label CDATA #REQUIRED>
<!ELEMENT toc (topic+)>
<!ATTLIST toc label CDATA #REQUIRED>
<!ELEMENT link EMPTY>
<!ATTLIST link toc CDATA #REQUIRED>

While you'll often need to tweak the generated DTD to match your needs, it's a great start, and can save a lot of tedious DTD authoring. If you do get errors, they'll most likely crop up in recurrence operators. To try and avoid these sorts of errors, you can supply multiple instance documents in an effort to get an even better first cut:

relaxer -dtd toc.xml toc_gr.xml toc_jp.xml

A lot of times a few extra files will really help nail any optional elements, as well as refine the recurrence operators that Relaxer uses. In the case of , this line is incorrect:

<!ELEMENT topic (link)>

You need to add an optional modifier, as link is not always required to be present in topic elements:

<!ELEMENT topic (link)?>

Aside from these glitches, I still use Relaxer on a regular basis.

Validating XML Against a DTD

If you just can't wait until I talk about SAX, DOM, and JAXP for validation, it's simple to use a few nifty tools for validating your XML documents against a DTD. I like using xmllint, an app that comes with Red Hat, but can be downloaded for a variety of platforms at . Installation varies by platform, but you'll find the program simple and well-documented; just make sure you have the xmllint binary somewhere on your path. To validate against a DTD, first make sure your XML has a DOCTYPE declaration in it:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE toc SYSTEM "toc.dtd">

Then, execute xmllint with the --valid option:

xmllint --valid toc.xml --noout
Java Tip

xmllint will actually echo the XML you supply it, unless you supply the --noout option. With this option on, you'll only receive errors from validation. Further, in some cases, xmllint errors out when --noout is anywhere other than at the end of the command; always place it last to avoid these problems.


xmllint is a great tool for validating documents against generated DTDs; you'll often uncover errors, and be able to quickly correct them.