External Unparsed Entities and Notations

Not all data is XML. There are a lot of ASCII text files in the world that don't give two cents about escaping < as < or adhering to the other constraints by which an XML document is limited. There are probably even more JPEG photographs, GIF line art, QuickTime movies, MIDI sound files, and so on. None of these are well-formed XML, yet all of them are necessary components of many documents.

The mechanism that XML suggests for embedding these things in your documents is the external unparsed entity. The DTD specifies a name and a URI for the entity containing the non-XML data. For example, this ENTITY declaration associates the name turing_getting_off_bus with the JPEG image at http://www.turing.org.uk/turing/pi1/bus.jpg:

<!ENTITY turing_getting_off_bus SYSTEM "http://www.turing.org.uk/turing/pi1/bus.jpg" NDATA jpeg>

Notations

Since the data is not in XML format, the NDATA declaration specifies the type of the data. Here the name jpeg is used. XML does not recognize this as meaning an image in a format defined by the Joint Photographs Experts Group. Rather this is the name of a notation declared elsewhere in the DTD using a NOTATION declaration like this:

<!NOTATION jpeg SYSTEM "image/jpeg">

Here we've used the MIME media type image/jpeg as the external identifier for the notation. However, there is absolutely no standard or even a suggestion for exactly what this identifier should be. Individual applications must define their own requirements for the contents and meaning of notations.

Embedding Unparsed Entities in Documents

The DTD only declares the existence, location, and type of the unparsed entity. To actually include the entity in the document at one or more locations, you insert an element with an ENTITY type attribute whose value is the name of an unparsed entity declared in the DTD. You do not use an entity reference like &turing_getting_off_bus;. Entity references can only refer to parsed entities.

Suppose the image element and its source attribute are declared like this:

<!ELEMENT image EMPTY>
<!ATTLIST image source ENTITY #REQUIRED>

Then, this image element would refer to the photograph at http://www.turing.org.uk/turing/pi1/bus.jpg:

<image source="turing_getting_off_bus"/>

We should warn you that XML doesn't guarantee any particular behavior from an application that encounters this type of unparsed entity. It very well may not display the image to the user. Indeed, the parser may be running in an environment where there's no user to display the image to. It may not even understand that this is an image. The parser may not load or make any sort of connection with the server where the actual image resides. At most, it will tell the application on whose behalf it's parsing that there is an unparsed entity at a particular URI with a particular notation and let the application decide what, if anything, it wants to do with that information.

TIP: Unparsed general entities are not the only plausible way to embed non-XML content in XML documents. In particular, a simple URL, possibly associated with an XLink, does a fine job for many purposes, just as it does in HTML (which gets along just fine without any unparsed entities). Including all the necessary information in a single empty element like <image source = "http://www.turing.org.uk/turing/pi1/bus.jpg" /> is arguably preferable to splitting the same information between the element where it's used and the DTD of the document in which it's used. The only thing an unparsed entity really adds is the notation, but that's too nonstandard to be of much use.
In fact, many experienced XML developers, including the authors of this tutorial, feel strongly that unparsed entities are a complicated, confusing mistake that should never have been included in XML in the first place. Nonetheless, they are a part of the specification, so we describe them here.

Notations for Processing Instruction Targets

Notations can also be used to identify the exact target of a processing instruction. A processing instruction target must be an XML name, which means it can't be a full path like /usr/local/bin/tex. A notation can identify a short XML name like tex with a more complete specification of the external program for which the processing instruction is intended. For example, this notation associates the target tex with the more complete path /usr/local/bin/tex:

<!NOTATION tex SYSTEM "/usr/local/bin/tex">

In practice, this technique isn't much used or needed. Most applications that read XML files and pay attention to particular processing instructions simply recognize a particular target string like php or robots.