XML Basics

The basic syntax of XML is extremely simple. If you've worked with HTML, you're already halfway there. As with HTML, XML represents information as text using tags to add structure. A tag begins with a name sandwiched between less-than (<) and greater-than (>) characters. Unlike HTML, XML tags must always be balanced; in other words, an opening tag must always be followed by a closing tag. A closing tag looks just like the opening tag but starts with a less-than sign and a slash (</). An opening tag, closing tag, and any content in between are collectively referred to as an element of the XML document. Elements can contain other elements, but they must be properly nested (all tags started within an element must be closed before the element itself is closed). Elements can also contain plain text or a mixture of elements and text (called mixed content). Comments are enclosed between <! and > markers. Here are a few examples:

 <!-- Simple -->
 <Sentence>This is text.</Sentence>
 <!-- Element -->
 <Paragraph><Sentence>This is text.</Sentence></Paragraph>
 <!-- Mixed -->
 <Paragraph>
 <Sentence>This <verb>is</verb> text.</Sentence>
 </Paragraph>
 <!-- Empty -->
 <PageBreak></PageBreak>

An empty tag can be written more compactly with a single tag ending with a slash and a greater-than sign (/>):

 <PageBreak/>

Attributes

An XML element can contain attributes, which are simple name-value pairs supplied inside the start tag.

 <Document type="LEGAL" >...</Document>
 <Image name="truffle.jpg"/>

The attribute value must always be enclosed in quotes. You can use double (") or single (') quotes. Single quotes are useful if the value contains double quotes. Attributes are intended to be used for simple, unstructured properties or compact identifiers associated wited properties or compact identifiers associated with the element data. It is always possible to make an attribute into a child element so, strictly speaking, there is no real need for attributes. But they often make the XML easier to read and more logical. In the case of the Document element in our snippet above, the attributes type and ID represent metadata about the document. We might expect that a Java class representing the Document would have an enumeration of document types such as LEGAL. In the case of the Image element, the attribute is simply a more compact way of including the filename. As a rule, attributes should be compact, with little significant internal structure (URLs push the envelope); by contrast, child elements can have arbitrary complexity.

XML Documents

An XML document begins with a header like the following and one root element:

 <?xml version="1.0" encoding
 ="UTF-8"?>
 <MyDocument>
 </MyDocument>

The header identifies the version of XML and the character encoding used. The root element is simply the top of the element hierarchy, which can be considered a tree. If you omit this header or have XML text without a single root element (as in our earlier simple examples), technically what you have is called an XML fragment.

Encoding

The default encoding for an XML document is UTF-8, the ASCII-friendly 8-bit Unicode encoding. This encoding preserves ASCII values, so English text is unaltered by it. It also allows Unicode values to be stored in a reasonably efficient way. An XML document may specify another encoding using the encoding attribute of the XML header. Within an XML document, certain characters are necessarily sacrosanct: for example, the < and > characters that indicate element tags. When you need to include these in your text, you must encode them. XML provides an escape mechanism called "entities" that allows for encoding special structures. XML has five predefined entities in, as shown in Table 24-1.

Table 24-1. XML entities

Entity	Encodes
`&`	& (ampersand)
`<`	< (less than)
`>`	> (greater than)
`"`	`"` (quotation mark)
`'`	`'` (apostrophe)

An alternative to encoding text in this way is to use a special "unparsed" section of text called a character data (CDATA) section. A CDATA section starts with the cryptic string <![CDATA[ and ends with ]]>, like this:

 <![CDATA[ Learning Java, Oracle ]]>

The CDATA section looks a little like a comment, but the data is still part of the document, just opaque to the parser. In the Java 5.0 XML APIs, another alternative is possible. You can use a special <include> directive to include the contents of a URL or file either as pre-escaped text or as parsed XML. This new feature is very convenient and we'll talk about it later in this chapter.

Namespaces

You've probably seen that HTML has a <body> tag that is used to structure web pages. Suppose for a moment that we are writing XML for a funeral home that also uses the tag <body> for some other, more macabre, purpose. This could be a problem if we want to mix HTML with our mortuary information. If you consider HTML and the funeral home tags to be languages in this case, the elements (tag names) used in a document are really the vocabulary of those languages. An XML namespace is a way of saying whose dictionary you are using for a given element, allowing us to mix them freely. (Later we'll talk about XML Schemas, which enforce the grammar and syntax of the language.) A namespace is specified with the xmlns attribute, whose value is a Uniform Resource Identifier (URI) that uniquely defines the set (and usually the meaning) of tags from that namespace:

 <element xmlns="namespaceURI">

Recall from that a URI is not necessarily a URL. URIs are more general than URLs. In practical terms, a URI is to be treated as a unique string. Often, the URI is, in fact, also a URL for a document describing the namespace, but that is only by convention. An xmlns namespace attribute can be applied to an element and affects all its (nested) children; this is called a default namespace for the element:

 <body xmlns="http://funeral-procedures.org/">

But more often it is desirable to mix and match namespaces on a tag-by-tag basis. To do this, we can use the xmlns attribute to define a special identifier for the namespace and use that identifier as a prefix on the tags in question. For example:

 <funeral xmlns:fun="http://funeral-procedures.org/">
 <html><head></head><body>
 <fun:body>Corpse #42</fun:body>
 </funeral>

In the above snippet of XML, we've qualified the body tag with the prefix "fun:" that we defined in the <funeral> tag. In this case, we should also qualify the root tag as well, reflexively:

 <fun:funeral xmlns:fun="http://funeral-procedures.org/">

In the history of XML, support for namespaces is relatively new. Not all parsers support them. To accommodate this, the XML parser factories supplied with Java have a switch to specify whether you want a parser that understands namespaces. As of Java 5.0, this switch is still off by default.

 parserFactory.setNamespaceAware( true );

We'll talk more about parsing in the sections on SAX and DOM later in this chapter.

Validation

A document that conforms to the basic rules of XML, with proper encoding and balanced tags, is called a well-formed document. Just because a document is syntactically correct, however, doesn't mean that it makes sense. Two related sets of tools, DTDs and XML Schemas, define ways to provide a grammar for your XML elements. They allow you to create syntactic rules, such as "a City element can appear only once inside an Address element." XML Schema goes further to provide a flexible language for describing the validity of data content of the tags, including both simple and compound data types made of numbers and strings. XML Schema is a newer and far more complete solution (it includes data validation and not just rules about elements), but it is still not as widely used as the simpler DTDs. The standard (World Wide Web Consortium [W3C] approved) language for XML Schema is also fairly heavy and complex. For this reason, the Java validation APIs in Java 5.0 can support other schema languages for validating XML. We'll talk about the options later in this chapter. A document that is checked against a DTD or XML Schema description and follows the rules is called a valid document. A document can be well-formed without being valid, but not vice versa.

HTML to XHTML

To speak very loosely, we could say that the most popular and widely used form of XML in the world today is HTML. The terminology is loose because HTML is not really well-formed XML. HTML tags violate XML's rule forbidding unbalanced elements; the common <p> tag is typically used without a closing tag, for example. HTML attributes also don't require quotes. XML tags are also case-sensitive; <P> and <p> are two different tags in XML. We could generously say that HTML is "forgiving" with respect to details like this, but as a developer, you know that sloppy syntax results in ambiguity. XHTML is an alternate, strict XML version of HTML that is clear and unambiguous. This form of HTML works in most modern browsers. Fortunately, if you want to switch, you don't have to manually clean up all your HTML documents; Tidy (http://tidy.sf.net) is an open source program that automatically converts HTML to XHTML, validates it, and corrects common mistakes.