XML 1.0 - XML - Java Programming Language

It all begins with the XML 1.0 Recommendation, which you can read in its entirety at http://www.w3.org/TR/REC-xml. Example 1-1 shows an XML document that conforms to this specification. I'll use it to illustrate several important concepts.

Example A typical XML document is long and verbose

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns="http://purl.org/rss/1.0/" xmlns:admin="http://webns.net/mvcb/" xmlns:l="http://purl.org/rss/1.0/modules/link/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
 <!--Generated by Blogger v5.0-->
 <channel rdf:about="http://www.neilgaiman.com/journal/journal.asp">
 <title>Neil Gaiman's Journal</title>
 <link>http://www.neilgaiman.com/journal/journal.asp</link>
 <description>Neil Gaiman's Journal</description>
 <dc:date>2005-04-30T01:57:38Z</dc:date>
 <dc:language>en-US</dc:language>
 <admin:generatorAgent rdf:resource="http://www.blogger.com/" />
 <admin:errorReportsTo rdf:resource="mailto:rss-errors@bugmenot.com" />
 <items>
 <rdf:Seq>
 <rdf:li rdf:resource="http://www.neilgaiman.com/journal/2005/04/three-photographs.asp" />
 <rdf:li rdf:resource="http://www.neilgaiman.com/journal/2005/04/jetlag-morning.asp" />
 <rdf:li rdf:resource="http://www.neilgaiman.com/journal/2005/04/demon-days.asp" />
 <rdf:li rdf:resource="http://www.neilgaiman.com/journal/2005/04/more-from-mailbag.asp" />
 <rdf:li rdf:resource="http://www.neilgaiman.com/journal/2005/04/two-days.asp" />
 <rdf:li rdf:resource="http://www.neilgaiman.com/journal/2005/04/finishing-things.asp" />
 </rdf:Seq>
 </items>
 </channel>
 <!-- and so on... -->
</rdf:RDF>

For those of you who are curious, this is the RSS feed for Neil Gaiman's blog (http://www.neilgaiman.com). It uses a lot of RSS syntax, which I'll cover in in detail.

A lot of this specification describes what is mostly intuitive. If you've done any HTML authoring, or SGML, you're already familiar with the concept of elements (such as items and channel in Example 1-1) and attributes (such as resource and content). XML defines how to use these items and how a document must be structured. XML spends more time defining tricky issues like whitespace than introducing any concepts that you're not at least somewhat familiar with. One exception may be that some of the elements in Example 1-1 are in the form:

[prefix]:[element name]

Such as rdf:li. These are elements in an XML namespace, something I'll explain in detail shortly. An XML document can be broken into two basic pieces: the header, which gives an XML parser and XML apps information about how to handle the document, and the content, which is the XML data itself. Although this is a fairly loose division, it helps us differentiate the instructions to apps within an XML document from the XML content itself, and is an important distinction to understand. The header is simply the XML declaration, in this format:

<?xml version="1.0" encoding="UTF-8"?>

This header includes an encoding, and can also indicate whether the document is a standalone document or requires other documents to be referenced for a complete understanding of its meaning:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

The rest of the header is made up of items like the DOCTYPE declaration (not included in the example):

<!DOCTYPE RDF SYSTEM "DTDs/RDF-gaiman.dtd">

In this case, the declaration refers to a file on the local system, in the directory DTDs/ called RDF-gaiman.dtd. Any time you use a relative or absolute file path or a URL, you want to use the SYSTEM keyword. The other option is using the PUBLIC keyword, and following it with a public identifier. This means that the W3C or another consortium has defined a standard DTD that is associated with that public identifier. As an example, take the DTD statement for XHTML 1.0:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
 "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

Here, a public identifier is supplied (the funny little string starting with -//), followed by a system identifier (the URL). If the public identifier cannot be resolved, the system identifier is used instead. You may also see processing instructions at the top of a file, and they are generally considered part of a document's header, rather than its content. They look like this:

<?xml-stylesheet href="xsl/javaxml.html.xsl" type="text/xsl"?>
<?xml-stylesheet href="xsl/javaxml.wml.xsl" type="text/xsl" media="wap"?>
<?cocoon-process type="xslt"?>

Each is considered to have a target (the first word, like xml-stylesheet or cocoon-process) and data (the rest). Often, the data is in the form of name-value pairs, which can really help readability. This is only a good practice, though, and not required, so don't depend on it. Other than that, the bulk of your XML document should be content; in other words, elements, attributes, and data that you have put into it.

The Root Element

The root element is the highest-level element in the XML document, and must be the first opening tag and the last closing tag within the document. It provides a reference point that enables an XML parser or XML-aware app to recognize a beginning and end to an XML document. In Example 1-1, the root element is RDF:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns="http://purl.org/rss/1.0/" xmlns:admin="http://webns.net/mvcb/" xmlns:l="http://purl.org/rss/1.0/modules/link/" xmlns:content="http://purl.org/rss/1.0/modules/content/">
 <!-- Document content -->
</rdf:RDF>

This tag and its matching closing tag surround all other data content within the XML document. XML specifies that there may be only one root element in a document. In other words, the root element must enclose all other elements within the document. Aside from this requirement, a root element does not differ from any other XML element. It's important to understand this, because XML documents can reference and include other XML documents. In these cases, the root element of the referenced document becomes an enclosed element in the referring document and must be handled normally by an XML parser. Defining root elements as standard XML elements without special properties or behavior allows document inclusion to work seamlessly.

Elements

So far, I have glossed over defining an actual element. Let's take an in-depth look at elements, which are represented by arbitrary names and must be enclosed in angle brackets. There are several different variations of elements in the sample document, as shown here:

 <!-- Standard element opening tag -->
 <items>
 <!-- Standard element with attribute -->
 <rdf:li rdf:resource="http://www.neilgaiman.com/journal/2005/04/three-photographs.asp">
 <!-- Element with textual data -->
 <dc:creator>Neil Gaiman</dc:creator>
 <!-- Empty element -->
 <l:permalink l:type="text/html" rdf:resource="http://www.neilgaiman.com/journal/2005/04/finishing-things.asp"
 />
 <!-- Standard element closing tag -->
 </items>

This isn't actual XML; it's just a collection of examples. Trying to parse something like this would fail, as there are opening tags without corresponding closing tags.

The first rule in creating elements is that their names must start with a letter or underscore, and then may contain any amount of letters, numbers, underscores, hyphens, or periods. They may not contain embedded spaces:

<!-- Embedded spaces are not allowed -->
<my element name>

XML element names are also case-sensitive. Generally, using the same rules that govern Java variable naming will result in sound XML element naming. Using an element named tcbo to represent Telecommunications Business Object is not a good idea because it is cryptic, while an overly verbose tag name like beginningOfNewChapter just clutters up a document. Keep in mind that your XML documents will probably be seen by other developers and content authors, so clear documentation through good naming is essential. Every opened element must in turn be closed. There are no exceptions to this rule as there are in many other markup languages, like HTML. An ending element tag consists of the forward slash and then the element name: </items>. Between an opening and closing tag, there can be any number of additional elements or textual data. However, you cannot mix the order of nested tags; the first opened element must always be the last closed element. If any of the rules for XML syntax are not followed in an XML document, the document is not well-formed. A well-formed document is one in which all XML syntax rules are followed, and all elements and attributes are correctly positioned. However, a well-formed document is not necessarily valid, which means that it follows the constraints set upon a document by its DTD or schema. There is a significant difference between a well-formed document and a valid one; the rules I discuss in this section ensure that your document is well-formed, while the rules discussed in 2 ensure that your document is valid. As an example of a document that is not well-formed, consider this XML fragment:

<tag1>
 <tag2>
</tag1>
 </tag2>

The order of nesting of tags is incorrect, as the opened <tag2> is not followed by a closing </tag2> within the surrounding tag1 element. However, even if these syntax errors are corrected, there is still no guarantee that the document will be valid. While this example of a document that is not well-formed may seem trivial, remember that this would be acceptable HTML, and commonly occurs in large tables within an HTML document. In other words, HTML and many other markup languages do not require well-formed XML documents. XML's strict adherence to ordering and nesting rules allows data to be parsed and handled much more quickly than when using markup languages without these constraints. The last rule I'll look at is the case of empty elements. I already said that XML tags must always be paired; an opening tag and a closing tag constitute a complete XML element. There are cases where an element is used purely by itself, like a flag stating a chapter is incomplete, or where an element has attributes but no textual data, like an image declaration in HTML. These would have to be represented as:

<admin:generatorAgent rdf:resource="http://www.blogger.com/">
</admin:generatorAgent>
<img src="/images/xml.gif"></img>

This is obviously a bit silly, and adds clutter to what can often be very large XML documents. The XML specification provides a means to signify both an opening and closing element tag within one element:

<admin:generatorAgent rdf:resource="http://www.blogger.com/" />
<img src="/images/xml.gif" />

What's with the Space Before the End Slash?

Well, let me tell you. I've had the unfortunate pleasure of working with Java and XML since late 1998, when things were rough at best. And some web browsers at that time (and some today, to be honest) would only accept XHTML (HTML that is well-formed) in very specific formats. Most notably, tags like <br> that are never closed in HTML must be closed in XHTML, resulting in <br/>. Some of these browsers would completely ignore a tag like this; however, oddly enough, they would happily process <br /> (note the space before the end slash). I got used to making my XML not only well-formed, but consumable by these browsers. I've never had a good reason to change these habits, so you get to see them in action here. This nicely solves the problem of unnecessary clutter, and still follows the rule that every XML element must have a matching end tag; it simply consolidates both start and end tag into a single tag.

Attributes

In addition to text contained within an element's tags, an element can also have attributes. Attributes are included with their respective values within the element's opening declaration (which can also be its closing declaration!). For example, in the channel element, a URL for information about the channel is noted in an attribute:

<channel rdf:about="http://www.neilgaiman.com/journal/journal.asp">

In this example, rdf:about is the attribute name; the value is the URL, "http://www.neilgaiman.com/journal/journal.asp". Attribute names must follow the same rules as XML element names, and attribute values must be within quotation marks. Although both single and double quotes are allowed, double quotes are a widely used standard and result in XML documents that model Java coding practices. In addition to learning how to use attributes, there is an issue of when to use attributes. Because XML allows such a variety of data formatting, it is rare that an attribute cannot be represented by an element, or that an element could not easily be converted to an attribute. Although there's no specification or widely accepted standard for determining when to use an attribute and when to use an element, there is a good rule of thumb: use elements for multiple-valued data and attributes for single-valued data. If data can have multiple values, or is very lengthy, the data most likely belongs in an element. It can then be treated primarily as textual data, and is easily searchable and usable. Examples are the description of a tutorial's chapters, or URLs detailing related links from a site. However, if the data is primarily represented as a single value, it is best represented by an attribute. A good candidate for an attribute is the section of a chapter; while the section item itself might be an element and have its own title, the grouping of chapters within a section could be easily represented by a section attribute within the chapter element. This attribute would allow easy grouping and indexing of chapters, but would never be directly displayed to the user. Another good example of a piece of data that could be represented in XML as an attribute is if a particular table or chair is on layaway. This instruction could let an XML app used to generate a brochure or flyer know to not include items on layaway in current stock; obviously this is a true or false value, and has only a singular value at any time. Again, the app client would never directly see this information, but the data would be used in processing and handling the XML document. If after all of this analysis you are still unsure, you can always play it safe and use an element.

Namespaces

Note the use of namespaces in the root element of Example 1-1:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns="http://purl.org/rss/1.0/" xmlns:admin="http://webns.net/mvcb/" xmlns:l="http://purl.org/rss/1.0/modules/link/" xmlns:content="http://purl.org/rss/1.0/modules/content/">

An XML namespace is a means of associating one or more elements in an XML document with a particular URI. This means that the element is identified by both its name and its namespace URI. In many complex XML documents, the same XML name (for example, author) may need to be used in different ways. For instance, in the example, there is an author for the RSS feed, as well as an author for each journal entry. While both of these pieces of data fit nicely into an element named author, they should not be taken as the same type of data. The XML namespaces specification nicely solves this problem. The namespace specification requires that a unique URI be associated with a prefix to distinguish the elements in one namespace from elements in other namespaces. So you could assign a URI of http://www.neilgaiman.com/entries, and associate it with the prefix journal, for use by journal-specific elements. You could then assign another URI, like http://www.w3.org/1999/02/22-rdf-syntax-ns, and a prefix of rss, for RSS-specific elements:

<rdf:RDF xmlns:rss="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:journal="http://www.neilgaiman.com/entries">

Now you can use those prefixes in your XML:

<rss:author>Doug Hally</rss:author>
<journal:author>Neil Gaiman</journal:author>

You can actually use a namespace prefix on the same element where that namespace is declared. For example, this is perfectly legal XML:

<rss:author xmlns:rss="http://www.w3.org/1999/02/22-rdf-syntax-ns#">Doug Hally</rss:author>

An XML parser can now easily distinguish these two different types of author; as an added benefit, the XML is a lot more human-readable now.

Entity References

One item I have not discussed is escaping characters, or referring to other constant type data values. For example, a common way to represent a path to an installation directory in online documentation is <path-to-Ant> or <TOMCAT_HOME>. Here, the user would replace the text with the appropriate choice of installation directory. In the following journal entry, there are several HTML tags within the entry itself:

When the shoot was done, my daughter Holly, who had been doing her homework in the room next door, and occasionally coming out to laugh at me, helped use up the last few pictures on the roll. She looks like she's having fun. I think I look a little dazed.<br /><br /><img src="http://www.neilgaiman.com/journal/neil_8313036.jpg" ><br /><br />This is the one we're going to be using on the tutorial jacket of ANANSI BOYS.

The problem is that XML parsers attempt to handle these bits of data (<br /> and <img>) as XML tags. This is a common problem, as any use of angle brackets results in this behavior. Entity references provide a way to overcome this problem. An entity reference is a special data type in XML used to refer to another piece of data. The entity reference consists of a unique name, preceded by an ampersand and followed by a semicolon: & [entity name] ;. When an XML parser sees an entity reference, the specified substitution value is inserted, and no processing of that value occurs. XML defines five entities to address the problem discussed in the example: < for the less-than bracket, > for the greater-than bracket, & for the ampersand sign itself, " for a double quotation mark, and ' for a single quotation mark or apostrophe. Using these special references, the entry can contain the HTML tags without having them interpreted as XML tags by the XML parser:

When the shoot was done, my daughter Holly, who had been doing her homework in the room next door, and occasionally coming out to laugh at me, helped use up the last few pictures on the roll. She looks like she's having fun. I think I look a little dazed.&lt;br /&gt;&lt;br /&gt;&lt;img src="http://www.neilgaiman.com/journal/neil_8313036.jpg" /&gt;&lt;br /&gt;&lt;br /&gt;This is the one we're going to be using on the tutorial jacket of ANANSI BOYS.

Once this document is parsed, the data is interpreted as normal HTML br and img tags, and the document is still considered well-formed. Also be aware that entity references are user-definable. This allows a sort of shortcut markup; for example, you might want to reference a copyright notice online somewhere. Because the copyright is used for multiple tutorials and articles, it doesn't make sense to include the actual text within hundreds of different XML documents; however, if the copyright is changed, all referring XML documents should reflect the changes:

<ora:copyright>&oracleCopyright;</ora:copyright>

Although you won't see how the XML parser is told what to reference when it sees &oracleCopyright; until the next chapter, you need to realize that there are more uses for entity references than just representing difficult or unusual characters within data.

Unparsed Data

The last XML construct to look at is the CDATA section marker. A CDATA section is used when a significant amount of data should be passed on to the calling app without any XML parsing. It is used when an unusually large number of characters would have to be escaped using entity references, or when spacing must be preserved. In an XML document, a CDATA section looks like this:

<content:encoded><![CDATA[Lot of flying yesterday and now I'm home again. For a day. Last night's useful post was written, but was eaten by weasels. Next week is the last week of <em>Beowulf-</em>with-Avary-and-Zemeckis work for a long while, and then I get to be home for about a month, if you don't count the trip to New York for tutorial Expo, and right now I just like the idea of sleeping in my own bed for a couple of nights running.
<br /><br /> </p>]]></content:encoded>

In this example, the information within the CDATA section does not have to use entity references or other mechanisms to alert the parser that reserved characters are being used; instead, the XML parser passes them unchanged to the wrapping program or app. At this point, you have seen the major components of XML documents. Although each has only been briefly described, this should give you enough information to recognize the parts of an XML document when you see them and know their general purpose.