Previous | Next
XML Reviewtutorial, XML is a format for storing structured data. Although it looks a lot like HTML, XML is much more strict with quotes, properly terminated tags, and other such details. XML does not define tag names, so document authors must invent their own set of tags or look towards a standards organization that defines a suitable XML markup language. A markup language is essentially a set of custom tags with semantic meaning behind each tag; XSLT is one such markup language, since it is expressed using XML syntax. The terms element and tag are often used interchangeably, and both are used in this tutorial. Speaking from a more technical viewpoint, element refers to the concept being modeled, while tag refers to the actual markup that appears in the XML document. So SGML, XML, and Markup LanguagesStandard Generalized Markup Language (SGML) forms the basis for HTML, XHTML, XML, and XSLT, but in very different ways for each. Figure 1-2 illustrates the relationships between these technologies. Figure 1-2. SGML heritageSGML is a very sophisticated metalanguage designed for large and complex documentation. As a metalanguage, it defines syntax rules for tags but does not define any specific tags. HTML, on the other hand, is a specific markup language implemented using SGML. A markup language defines its own set of tags, such as XML, as shown in Figure 1-2, is a subset of SGML. XML documents are compatible with SGML documents, however XML is a much smaller language. A key goal of XML is simplicity, since it has to work well on the Web where bandwidth and limited client processing power is a concern. Because of its simplicity, XML is easier to parse and validate, making it a better performer than SGML. XML is also a metalanguage, which explains why XML does not define any tags of its own. XSLT is a particular markup language implemented using XML, and will be covered in detail in the next two chapters. XHTML, like XSLT, is also an XML-based markup language. XHTML is designed to be a replacement for HTML and is almost completely compatible with existing web browsers. Unlike HTML, however, XHTML is based strictly on XML, and the rules for well-formed documents are very clearly defined. This means that it is much easier for vendors to develop editors and developing tools to deal with XHTML, because the syntax is much more predictable and can be validated just like any other XML document. Many of the examples in this tutorial use XHTML instead of HTML, although XSLT can easily handle either format.
As we look at more advanced techniques for processing XML with XSLT, we will see that XML is not always dealt with in terms of a text file containing tags. From a certain perspective, XML files and their tags are really just a serialized representation of the underlying XML elements. This serialized form is good for storing XML data in files but may not be the most efficient format for exchanging data between systems or programmatically modifying the underlying data. For particularly large documents, a relational or object database offers far better scalability and performance than native XML text files. XML SyntaxExample 1-1 shows a sample XML document that contains data about U.S. Presidents. This document is said to be well-formed because it adheres to several basic rules about proper XML formatting. Example 1-1. presidents.xml<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE presidents SYSTEM "presidents.dtd"> <presidents> <president> <term from="1789" to="1797"/> <name> <first>George</first> <last>Washington</last> </name> <party>Federalist</party> <vicePresident> <name> <first>John</first> <last>Adams</last> </name> </vicePresident> </president> <president> <term from="1797" to="1801"/> <name> <first>John</first> <last>Adams</last> </name> <party>Federalist</party> <vicePresident> <name> <first>Thomas</first> <last>Jefferson</last> </name> </vicePresident> </president> <!-- remaining presidents omitted --> </presidents> In HTML, a missing tag here and there or mismatched quotes are not disastrous. Browsers make every effort to go ahead and display these poorly formatted documents anyway. This makes the Web a much more enjoyable environment because users are not bombarded with constant syntax errors. Since the primary role of XML is to represent structured data, being well-formed is very important. When two banking systems exchange data, if the message is corrupted in any way, the receiving system must reject the message altogether or risk making the wrong assumptions. This is important for XSLT developers to understand because XSLT itself is expressed using XML. When writing stylesheets, you must always adhere to the basic rules for well-formed documents. All well-formed XML documents must have exactly one root element . In Example 1-1, the root element is <name> <first>George</first> <last>Washington</last> </name> Although whitespace (spaces, tabs, and linefeeds) between elements is typically irrelevant, it can make documents more readable if you take the time to indent consistently. Although XML parsers preserve whitespace, it does not affect the meaning of the underlying elements. In this example, the <name> <first>George XML provides an alternate syntax for terminating elements that do not have children, formally known as empty elements . The <term from="1797" to="1801"/> The closing slash indicates that this element does not contain any content , although it may contain attributes. An attribute is a name/value pair, such as Most presidents had middle names, some did not have vice presidents, and others had several vice presidents. For our example XML file, these are known as optional elements. Ulysses Grant, for example, had two vice presidents. He also had a middle name: <president> <term from="1869" to="1877"/> <name> <first>Ulysses</first> Capitalization is also important in XML. Unlike HTML, all XML tags are case sensitive. This means that The following list summarizes the basic rules for a well-formed XML document:
This is not the complete list of rules but is sufficient to get you through the examples in this tutorial. Clearly, most HTML documents are not well-formed. Many tags, such as ValidationA well-formed XML document adheres to the basic syntax guidelines just outlined. A valid XML document goes one step further by adhering to either a Document Type Definition (DTD) or an XML Schema. In order to be considered valid, an XML document must first be well-formed. Stated simply, DTDs are the traditional approach to validation, and XML Schemas are the logical successor. XML Schema is another specification from the W3C and offers much more sophisticated validation capabilities than DTDs. Since XML Schema is very new, DTDs will continue to be used for quite some time. You can learn more about XML Schema at http://www.w3.org/XML/Schema. The second line of Example 1-1 contains the following document type declaration: <!DOCTYPE presidents SYSTEM "presidents.dtd"> This refers to the DTD that exists in the same directory as the presidents.xml file. In many cases, the DTD will be referenced by a URI instead: <!DOCTYPE presidents SYSTEM "http://www.javaxslt.com/dtds/presidents.dtd"> Regardless of where the DTD is located, it contains rules that define the allowable structure of the XML data. Example 1-2 shows the DTD for our list of presidents. Example 1-2. presidents.dtd<!ELEMENT presidents (president+)> <!ELEMENT president (term, name, party, vicePresident*)> <!ELEMENT name (first, middle*, last, nickname?)> <!ELEMENT vicePresident (name)> <!ELEMENT first (#PCDATA)> <!ELEMENT last (#PCDATA)> <!ELEMENT middle (#PCDATA)> <!ELEMENT nickname (#PCDATA)> <!ELEMENT party (#PCDATA)> <!ELEMENT term EMPTY> <!ATTLIST term from CDATA #REQUIRED to CDATA #REQUIRED > The first line in the DTD says that the The <name> <first>George</first> Elements such as <term from="1869" to="1877"/> We will not cover the remaining syntax rules for DTDs in this tutorial, primarily because they do not have much impact on our code as we apply XSLT stylesheets. DTDs are primarily used during the parsing process, when XML data is read from a file into memory. When generating XML for a website, you generally produce new XML rather than parse existing XML, so there is much less need to validate. One area where we will use DTDs, however, is when we examine how to write unit tests for our Java and XSLT code. This will be covered in "Development Environment, Testing, and Performance". Java and XMLJava APIs for XML such as SAX, DOM, and JDOM will be used throughout this tutorial. Although we will not go into a great deal of detail on specific parsing APIs, the Java-based XSLT tools do build on these technologies, so it is important to have a basic understanding of what each API does and where it fits into the XML landscape. For in-depth information on any of these topics, you might want to pick up a copy of Java & XML by Brett McLaughlin (Anonymous). A parser is a tool that reads XML data into memory. The most common pattern is to parse the XML data from a text file, although Java XML parsers can also read XML from any Java SAXIn the Java community, Simple API for XML (SAX) is the most commonly used XML parsing method today. SAX is a free API available from David Megginson and members of the XML-DEV mailing list (http://www.xml.org/xml-dev). It can be downloaded[2] from http://www.megginson.com/SAX. Although SAX has been ported to several other languages, we will focus on the Java features. SAX is only responsible for scanning through XML data top to bottom and sending event notifications as elements, text, and other items are encountered; it is up to the recipient of these events to process the data. SAX parsers do not store the entire document in memory, therefore they have the potential to be very fast for even huge files.
Currently, there are two versions of SAX: 1.0 and 2.0. Many changes were made in version 2.0, and the SAX examples in this tutorial use this version. Most SAX parsers should support the older 1.0 classes and interfaces, however, you will receive deprecation warnings from the Java compiler if you use these older features. Java SAX parsers are implemented using a series of interfaces. The most important interface is <first>George</first> the NOTE: Depending on the SAX implementation, the Since Getting back to XSLT, you may be wondering where SAX fits into the picture. It turns out that XSLT processors typically have the ability to gather input from a series of SAX events as an alternative to static XML files. Somewhat nonintuitively, it also turns out that you can generate your own series of SAX events rather easily -- without using a SAX parser. Since a SAX parser just calls a series of methods on the DOMThe Document Object Model (DOM) is an API that allows computer programs to manipulate the underlying data structure of an XML document. DOM is a W3C Recommendation, and implementations are available for many developing languages. The in-memory representation of XML is typically referred to as a DOM tree because DOM is a tree data structure. The root of the tree represents the XML document itself, using the Strangely enough, the DOM Level 2 Recommendation does not provide standard mechanisms for reading or writing XML data. Instead, each vendor implementation does this a little bit differently. This is generally not a big problem because every DOM implementation out there provides some mechanism for both parsing and serializing, or writing out XML files. The unfortunate result, however, is that reading and writing XML will cause vendor-specific code to creep into any application you write. NOTE: At the time of this writing, a new W3C document called "Document Object Model (DOM) Level 3 Content Models and Load and Save Specification" was in the working draft status. Once this specification reaches the recommendation status, DOM will provide a standard mechanism for reading and writing XML. Since DOM does not specify a standard way to read XML data into memory, most DOM (if not all) implementations delegate this task to a dedicated parser. In the case of Java, SAX is the preferred parsing technology. Figure 1-3 illustrates the typical interaction between SAX parsers and DOM implementations. Figure 1-3. DOM and SAX interactionAlthough it is important to understand how these pieces fit together, we will not go into detailed parsing syntax in this tutorial. As we progress to more sophisticated topics, we will almost always be generating XML dynamically rather than parsing in static XML data files. For this reason, let's look at how DOM can be used to generate a new document from scratch. Example 1-3 contains XML for a personal library. Example 1-3. library.xml<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE library SYSTEM "library.dtd"> <library> <!-- This is an XML comment --> <publisher id="anonymous"> <name>Anonymous</name> <street>1005 Gravenstein Hwy North</street> <city>Sebastopol</city> <state>CA</state> <postal>95472</postal> </publisher> <tutorial publisher="anonymous" CNPJ="1-56592-709-5"> <version>1</version> <publicationDate mm="10" yy="1999"/> <title>XML Pocket Reference</title> <author>Robert Eckstein</author> </tutorial> <tutorial publisher="anonymous" CNPJ="0-596-00016-2"> <version>1</version> <publicationDate mm="06" yy="2000"/> <title>Java and XML</title> <author>Brett McLaughlin</author> </tutorial> </library> As shown in library.xml, a public class Tutorial { private String author; private String title; ... public String getAuthor( ) { return this.author; } public String getTitle( ) { return this.title; } ... } Each of these three helper classes is merely used to hold data. The code that creates XML is encapsulated in a separate class called Example 1-4. XML generation using DOMpackage chap1; import java.io.*; import java.util.*; import org.w3c.dom.Document; import org.w3c.dom.Element; /** * An example from "Introduction ". Creates the library XML file using the * DOM API. */ public class LibraryDOMCreator { /** * Create a new DOM org.w3c.dom.Document object from the specified * Library object. * * @param library an application defined class that * provides a list of publishers and tutorials. * @return a new DOM document. */ public Document createDocument(Library library) throws javax.xml.parsers.ParserConfigurationException { // Use Sun's Java API for XML Parsing to create the // DOM Document javax.xml.parsers.DocumentBuilderFactory dbf = javax.xml.parsers.DocumentBuilderFactory.newInstance( ); javax.xml.parsers.DocumentBuilder docBuilder = dbf.newDocumentBuilder( ); Document doc = docBuilder.newDocument( ); // NOTE: DOM does not provide a factory method for creating: // <!DOCTYPE library SYSTEM "library.dtd"> // Apache's Xerces provides the createDocumentType method // on their DocumentImpl class for doing this. Not used here. // create the <library> document root element Element root = doc.createElement("library"); doc.appendChild(root); // add <publisher> children to the <library> element Iterator publisherIter = library.getPublishers().iterator( ); while (publisherIter.hasNext( )) { Publisher pub = (Publisher) publisherIter.next( ); Element pubElem = createPublisherElement(doc, pub); root.appendChild(pubElem); } // now add <tutorial> children to the <library> element Iterator tutorialIter = library.getBooks().iterator( ); while (tutorialIter.hasNext( )) { Tutorial tutorial = (Book) tutorialIter.next( ); Element tutorialElem = createBookElement(doc, tutorial); root.appendChild(tutorialElem); } return doc; } private Element createPublisherElement(Document doc, Publisher pub) { Element pubElem = doc.createElement("publisher"); // set id="anonymous" attribute pubElem.setAttribute("id", pub.getId( )); Element name = doc.createElement("name"); name.appendChild(doc.createTextNode(pub.getName( ))); pubElem.appendChild(name); Element street = doc.createElement("street"); street.appendChild(doc.createTextNode(pub.getStreet( ))); pubElem.appendChild(street); Element city = doc.createElement("city"); city.appendChild(doc.createTextNode(pub.getCity( ))); pubElem.appendChild(city); Element state= doc.createElement("state"); state.appendChild(doc.createTextNode(pub.getState( ))); pubElem.appendChild(state); Element postal = doc.createElement("postal"); postal.appendChild(doc.createTextNode(pub.getPostal( ))); pubElem.appendChild(postal); return pubElem; } private Element createBookElement(Document doc, Tutorial tutorial) { Element tutorialElem = doc.createElement("tutorial"); tutorialElem.setAttribute("publisher", tutorial.getPublisher().getId( )); tutorialElem.setAttribute("CNPJ", tutorial.getCNPJ( )); Element version = doc.createElement("version"); version.appendChild(doc.createTextNode( Integer.toString(tutorial.getversion( )))); tutorialElem.appendChild(version); Element publicationDate = doc.createElement("publicationDate"); publicationDate.setAttribute("mm", Integer.toString(tutorial.getPublicationMonth( ))); publicationDate.setAttribute("yy", Integer.toString(tutorial.getPublicationYear( ))); tutorialElem.appendChild(publicationDate); Element title = doc.createElement("title"); title.appendChild(doc.createTextNode(tutorial.getTitle( ))); tutorialElem.appendChild(title); Element author = doc.createElement("author"); author.appendChild(doc.createTextNode(tutorial.getAuthor( ))); tutorialElem.appendChild(author); return tutorialElem; } public static void main(String[] args) throws IOException, javax.xml.parsers.ParserConfigurationException { Library lib = new Library( ); LibraryDOMCreator ldc = new LibraryDOMCreator( ); Document doc = ldc.createDocument(lib); // write the Document using Apache Xerces // output the Document with UTF-8 encoding; indent each line org.apache.xml.serialize.OutputFormat fmt = new org.apache.xml.serialize.OutputFormat(doc, "UTF-8", true); org.apache.xml.serialize.XMLSerializer serial = new org.apache.xml.serialize.XMLSerializer(System.out, fmt); serial.serialize(doc.getDocumentElement( )); } } This example starts with the usual series of The workhorse of this class is the public Document createDocument(Library library) throws javax.xml.parsers.ParserConfigurationException { The The next step is to begin constructing a DOM javax.xml.parsers.DocumentBuilderFactory dbf = javax.xml.parsers.DocumentBuilderFactory.newInstance( ); javax.xml.parsers.DocumentBuilder docBuilder = dbf.newDocumentBuilder( ); Document doc = docBuilder.newDocument( ); This code relies on JAXP because the standard DOM API does not provide any support for creating a new JAXP provides a In DOM, new XML elements must always be created using factory methods, such as // create the <library> document root element Element root = doc.createElement("library"); doc.appendChild(root); At this point, the // add <publisher> children to the <library> element Iterator publisherIter = library.getPublishers().iterator( ); while (publisherIter.hasNext( )) { Publisher pub = (Publisher) publisherIter.next( ); For each instance of Element name = doc.createElement("name"); name.appendChild(doc.createTextNode(pub.getName( ))); pubElem.appendChild(name); The first line is pretty obvious, simply creating an empty The // write the document using Apache Xerces // output the document with UTF-8 encoding; indent each line org.apache.xml.serialize.OutputFormat fmt = new org.apache.xml.serialize.OutputFormat(doc, "UTF-8", true); org.apache.xml.serialize.XMLSerializer serial = new org.apache.xml.serialize.XMLSerializer(System.out, fmt); serial.serialize(doc.getDocumentElement( )); As we will see in "XSLT Processing with Java", JAXP 1.1 does provide a mechanism to perform this task using its transformation APIs, so we do not technically have to use the Xerces code listed here. The JAXP approach maximizes portability but introduces the overhead of an XSLT processor when all we really need is DOM. JDOMDOM is specified in the language independent Common Object Request Broker Architecture Interface Definition Language (CORBA IDL), allowing the same interfaces and concepts to be utilized by many different developing languages. Though valuable from a specification perspective, this approach does not take advantage of specific Java language features. JDOM is a Java-only API that can be used to create and modify XML documents in a more natural way. By taking advantage of Java features, JDOM aims to simplify some of the more tedious aspects of DOM developing. JDOM is not a W3C specification, but is open source software[3] available at http://www.jdom.org. JDOM is great from a developing perspective because it results in much cleaner, more maintainable code. Since JDOM has the ability to convert its data into a standard DOM tree, it integrates nicely with any other XML tool. JDOM can also utilize whatever XML parser you specify and can write out XML to any Java output stream or file. It even features a class called
The code in Example 1-5 shows how much easier JDOM is than DOM; it does the same thing as the DOM example, but is about fifty lines shorter. This difference would be greater for more complex applications. Example 1-5. XML generation using JDOMpackage com.anonymous.javaxslt.chap1; import java.io.*; import java.util.*; import org.jdom.DocType; import org.jdom.Document; import org.jdom.Element; import org.jdom.output.XMLOutputter; /** * An example from "Introduction ". Creates the library XML file. */ public class LibraryJDOMCreator { public Document createDocument(Library library) { Element root = new Element("library"); // JDOM supports the <!DOCTYPE...> DocType dt = new DocType("library", "library.dtd"); Document doc = new Document(root, dt); // add <publisher> children to the <library> element Iterator publisherIter = library.getPublishers().iterator( ); while (publisherIter.hasNext( )) { Publisher pub = (Publisher) publisherIter.next( ); Element pubElem = createPublisherElement(pub); root.addContent(pubElem); } // now add <tutorial> children to the <library> element Iterator tutorialIter = library.getBooks().iterator( ); while (tutorialIter.hasNext( )) { Tutorial tutorial = (Book) tutorialIter.next( ); Element tutorialElem = createBookElement(tutorial); root.addContent(tutorialElem); } return doc; } private Element createPublisherElement(Publisher pub) { Element pubElem = new Element("publisher"); pubElem.addAttribute("id", pub.getId( )); pubElem.addContent(new Element("name").setText(pub.getName( ))); pubElem.addContent(new Element("street").setText(pub.getStreet( ))); pubElem.addContent(new Element("city").setText(pub.getCity( ))); pubElem.addContent(new Element("state").setText(pub.getState( ))); pubElem.addContent(new Element("postal").setText(pub.getPostal( ))); return pubElem; } private Element createBookElement(Tutorial tutorial) { Element tutorialElem = new Element("tutorial"); // add publisher="anonymous" and CNPJ="1234567" attributes // to the <tutorial> element tutorialElem.addAttribute("publisher", tutorial.getPublisher().getId( )) .addAttribute("CNPJ", tutorial.getCNPJ( )); // now add an <version> element to <tutorial> tutorialElem.addContent(new Element("version").setText( Integer.toString(tutorial.getversion( )))); Element pubDate = new Element("publicationDate"); pubDate.addAttribute("mm", Integer.toString(tutorial.getPublicationMonth( ))); pubDate.addAttribute("yy", Integer.toString(tutorial.getPublicationYear( ))); tutorialElem.addContent(pubDate); tutorialElem.addContent(new Element("title").setText(tutorial.getTitle( ))); tutorialElem.addContent(new Element("author").setText(tutorial.getAuthor( ))); return tutorialElem; } public static void main(String[] args) throws IOException { Library lib = new Library( ); LibraryJDOMCreator ljc = new LibraryJDOMCreator( ); Document doc = ljc.createDocument(lib); // Write the XML to System.out, indent two spaces, include // newlines after each element new XMLOutputter(" ", true, "UTF-8").output(doc, System.out); } } The JDOM example is structured just like the DOM example, beginning with a method that converts a public Document createDocument(Library library) { The most striking difference in this particular method is the way in which the Element root = new Element("library"); // JDOM supports the <!DOCTYPE...> DocType dt = new DocType("library", "library.dtd"); Document doc = new Document(root, dt); As this comment indicates, JDOM allows you to refer to a DTD, while DOM does not. This is just another odd limitation of DOM that forces you to include implementation-specific code in your Java applications. Another area where JDOM shines is in its ability to create new elements. Unlike DOM, text is set directly on the private Element createPublisherElement(Publisher pub) { Element pubElem = new Element("publisher"); pubElem.addAttribute("id", pub.getId( )); pubElem.addContent(new Element("name").setText(pub.getName( ))); pubElem.addContent(new Element("street").setText(pub.getStreet( ))); pubElem.addContent(new Element("city").setText(pub.getCity( ))); pubElem.addContent(new Element("state").setText(pub.getState( ))); pubElem.addContent(new Element("postal").setText(pub.getPostal( ))); return pubElem; } Since methods such as buf.append("a").append("b").append("c"); In an effort to keep the JDOM code more readable, however, our example adds one element per line. The final piece of this pie is the ability to print out the contents of JDOM as an XML file. JDOM includes a class called new XMLOutputter(" ", true, "UTF-8").output(doc, System.out); The three arguments to JDOM and DOM interoperabilityCurrent XSLT processors are very flexible, generally supporting any of the following sources for XML or XSLT input:
JDOM is not directly supported by some XSLT processors, although this is changing fast.[4] For this reason, it is typical to convert a JDOM
org.jdom.output.DOMOutputter outputter = new org.jdom.output.DOMOutputter( ); org.w3c.dom.Document domDoc = outputter.output(jdomDoc); The DOM |