Previous | Next
Internationalization with XSLTIn this section, we explore the key techniques for internationalization (i18n) using XSLT. Although both Java and XSLT offer excellent support for i18n, pulling everything together into a working application is quite challenging. Hopefully this material will help to minimize some of the common obstacles. XSLT Stylesheet DesignIn its simplest form, i18n is accomplished by providing a separate XSLT stylesheet for each supported language. While this is easy to visualize, it results in far too much duplication of effort. This is because XSLT stylesheets typically contain some degree of developing logic in addition to pure display information. To illustrate this point, directory.xml is presented in Example 8-16. This is a very basic XML datafile that will be transformed using either English or Spanish XSLT stylesheets. Example 8-16. directory.xml<?xml version="1.0" encoding="UTF-8"?> <directory> <employee category="manager"> <name>Joe Smith</name> <phone>4-0192</phone> </employee> <employee category="developer"> <name>Sally Jones</name> <phone>4-2831</phone> </employee> <employee category="developer"> <name>Roger Clark</name> <phone>4-3345</phone> </employee> </directory> The screen shot shown in Figure 8-6 shows how an XSLT stylesheet transforms this XML into HTML. And finally, Example 8-17 lists the XSLT stylesheet that produces this output. Figure 8-6. English XSLT outputExample 8-17. directory_basic.xslt<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="html" encoding="UTF-8"/> <xsl:template match="/"> <html> <head> In this stylesheet, all locale-specific content is highlighted. This is information that must be changed to support a different language. As you can see, only a small portion of the XSLT is specific to the English language and is embedded directly within the stylesheet logic. The entire stylesheet must be rewritten to support another language. Fortunately, there is an easy solution to this problem. XSLT stylesheets can import other stylesheets; templates and variables in the importing stylesheet take precedence over conflicting items in the imported stylesheet. By isolating locale-specific content, we can use Example 8-18. directory_en.xslt<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="html" encoding="UTF-8"/> The XSLT stylesheet is now much more amenable to i18n. All locale-specific content is declared as a series of variables. Therefore, importing stylesheets can override them. The The Spanish version of the stylesheet is shown in Example 8-19. Example 8-19. directory_es.xslt<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> The Spanish stylesheet is much shorter because it merely overrides each of the locale-specific variables. The <xsl:import href="directory_en.xslt"/> Because of XSLT conflict-resolution rules, the variables defined in directory_es.xslt take precedence over those defined in directory_en.xslt. The same logic can be applied to templates, as well. This is useful in scenarios where the importing stylesheet needs to change behavior in addition to simply defining text translations. The following line is optional: <xsl:output method="html" encoding="UTF-8"/> In this example, the output method and encoding are identical to the English version of the stylesheet, so this line has no effect. However, the importing stylesheet may specify a different output method and encoding if desired. To perform the Spanish transformation using Xalan, issue the following command: $ java org.apache.xalan.xslt.Process -IN directory.xml -XSL directory_es.xslt Figure 8-7 shows the result of this transformation when displayed in a web browser. Figure 8-7. Spanish outputNOTE: In the i18n example stylesheets presented in this chapter, common functionality is placed into one stylesheet. Importing stylesheets then replace locale-specific text. This same technique can be applied to any stylesheet and is particularly important when writing custom XSLT for a specific browser. Most of your code should be portable across a variety of browsers and should be placed into reusable stylesheets. The parts that change should be placed into browser-specific stylesheets that import the common stylesheets. EncodingsA character encoding is a numeric representation of a particular character.[41] The US-ASCII encoding for the
The most comprehensive character encoding is called ISO/IEC 10646. This is also known as the Universal Character Set (UCS) and allocates a 32-bit number for each character. Although this allows UCS to uniquely identify every character in every language, it is not directly compatible with most computer software. Also, using 32 bits to represent each character results in a lot of wasted memory. Unicode is the official implementation of ISO/IEC 10646 and currently uses 16-bit characters. You can learn more about Unicode at http://www.unicode.org. UCS Transformation Formats (UTFs) are designed to support the UCS encoding while maintaining compatibility with existing computer software and encodings. UTF-8 and UTF-16 are the most common transformation formats, and all XML parsers and XSLT processors are required to support both. If you deal mostly with English text, UTF-8 is the most efficient and easiest to use. Because the first 128 UTF-8 characters are the same as the US-ASCII characters, existing applications can utilize many UTF-8 files transparently. When additional characters are required, however, UTF-8 encoding will use up to three bytes per character. UTF-16 is more efficient than UTF-8 for Chinese, Japanese, and Korean (CJK) ideographs. When using UTF-16, each character requires two bytes, while many will require three bytes under UTF-8 encoding. Either UTF-8 or UTF-16 should work. However, it is wise to test actual transformations with both encodings to determine which results in the smallest file for your particular data. On a pragmatic note, many applications and operating systems, particularly Unix and Linux variants, offer better support for UTF-8 encoding. As nearly every XSLT example in this tutorial has shown, the <xsl:output method="html" encoding="UTF-16"/> If this element is missing from the stylesheet, the XSLT processor is supposed to default to either UTF-8 or UTF-16 encoding.[42]
Creating the XML and XSLTThe XML input data, XSLT stylesheet, and result tree do not have to use the same character encodings or language. For example, an XSLT stylesheet may be encoded in UTF-16, but may specify UTF-8 as its output method:
Even though the first line specifies UTF-16, it is important that the text editor used to create this stylesheet actually uses UTF-16 encoding when saving the file. Otherwise, tools such as XML Spy (http://www.xmlspy.com) may report errors as shown in Figure 8-8. Figure 8-8. Error dialogTo further complicate matters, there are actually two variants of UTF-16. In UTF-16 Little Endian (UTF-16LE) encoding, the low byte of each two-byte character precedes the high byte. As expected, the high byte precedes the low byte in UTF-16 Big Endian (UTF-16BE) encoding. Fortunately, XML parsers can determine the encoding of a file by looking for a byte order mark. In UTF-16LE, the first byte of the file should start with 0xFFFE. In UTF-16BE files, the byte order mark is 0xFEFF. For the upcoming Chinese example, the NJStar Chinese word processor (http://www.njstar.com) was used to input the Chinese characters. This is an example of an editor that has the ability to input ideographs and store files in various encodings. The Windows version of Notepad can save files in Unicode (UTF-16LE) format, and the Windows 2000 version of Notepad adds support for UTF-8 and UTF-16BE. If all else fails, encoded text files can be created with Java using the FileOutputStream fos = new FileOutputStream("myFile.xml"); // the OutputStreamWriter specifies the encoding of the file Putting It All TogetherGetting all of the pieces to work together is often the trickiest aspect of i18n. To demonstrate the concepts, we will now look at XML datafiles, XSLT stylesheets, and a servlet that work together to support any combination of English, Chinese, and Spanish. A basic HTML form makes it possible for users to select which XML file and XSLT stylesheet will be used to perform a transformation. The screen shot in Figure 8-9 shows what this web page looks like. Figure 8-9. XML and XSLT language selectionAs you can see, there are three versions of the XML data, one for each language. Other than the language, the three files are identical. There are also three versions of the XSLT stylesheet, and the user can select any combination of XML and XSLT language. The character encoding for the resulting transformation is also configurable. UTF-8 and UTF-16 are compatible with Unicode and can display the Spanish and Chinese characters directly. ISO-8859-1, however, can display only extended character sets using entities such as In this example, users explicitly specify their language preference. It is also possible to write a servlet that uses the en, es, ja From this list, the application can attempt to select the appropriate language and character encoding without prompting the user. Chapter 13 of Java Servlet Developing, Second version by Jason Hunter (Anonymous) presents a detailed discussion of this technique along with a class called In Figure 8-10, the results of three different transformations are displayed. In the first window, a Chinese XSLT stylesheet is applied to a Chinese XML datafile. In the second window, the English version of the XSLT stylesheet is applied to the Spanish XML data. Finally, the Spanish XSLT stylesheet is applied to the Chinese XML data. Figure 8-10. Several language combinationsThe character encoding is generally transparent to the user. Switching to a different encoding makes no difference to the output displayed in Figure 8-10. However, it does make a difference when the page source is viewed. For example, when the output is UTF-8, the actual Chinese or Spanish characters are displayed in the source of the HTML page. When using ISO-8859-A, however, the source code looks something like this: <html> <head> <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> As you can see, the Chinese characters are replaced by their corresponding character entities, such as
XML dataEach of the three XML datafiles used by this example follows the format shown in Example 8-20. As you can see, the XML data merely lists translations from English to another language. All three files follow the same naming convention: numbers_english.xml, numbers_spanish.xml, and numbers_chinese.xml. Example 8-20. numbers_spanish.xml<?xml version="1.0" encoding="UTF-8"?> <numbers> <language>Español (Spanish)</language> <number english="one">uno</number> <number english="two">dos</number> <number english="three">tres</number> <number english="four">cuatro</number> <number english="five">cinco</number> <number english="six">seis</number> <number english="seven">siete</number> <number english="eight">ocho</number> <number english="nine">nueve</number> <number english="ten">diez</number> </numbers> XSLT stylesheetsThe numbers_english.xslt stylesheet is shown in Example 8-21 and follows the same pattern that was introduced earlier in this chapter. Specifically, it isolates locale-specific data as a series of variables. Example 8-21. numbers_english.xslt<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> As you can see, the default output encoding of this stylesheet is UTF-8. This can (and will) be overridden by the servlet, however. The Spanish stylesheet, numbers_spanish.xslt, is shown in Example 8-22. Example 8-22. numbers_spanish.xslt<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> The Chinese stylesheet, numbers_chinese.xslt, is not listed here because it is structured exactly like the Spanish stylesheet. In both cases, numbers_english.xslt is imported, and the three variables are overridden with language-specific text. Web page and servletThe user begins with the web page that was shown in Figure 8-9. The HTML source for this page is listed in Example 8-23. The language and encoding selections are posted to a servlet when the user clicks on the Submit button. Example 8-23. i18n.html<html> <head> <title>Internationalization Demo</title> </head> <body> The servlet, LanguageDemo.java, is shown in Example 8-24. This servlet accepts input from the i18n.html web page and then applies the XSLT transformation. Example 8-24. LanguageDemo.java servletpackage chap8; import java.io.*; import javax.servlet.*; import javax.servlet.http.*; import javax.xml.transform.*; import javax.xml.transform.stream.*; /** * Allows any combination of English, Spanish, and Chinese XML * and XSLT. */ public class LanguageDemo extends HttpServlet { public void doPost(HttpServletRequest req, HttpServletResponse res) throws ServletException, IOException { ServletContext ctx = getServletContext( ); // these are all required parameters from the HTML form String xmlLang = req.getParameter("xmlLanguage"); String xsltLang = req.getParameter("xsltLanguage"); String charEnc = req.getParameter("charEnc"); // convert to system-dependent path names After getting the three request parameters for XML, XSLT, and encoding, the servlet converts the XML and XSLT names to actual filenames: String xmlFileName = ctx.getRealPath( "/WEB-INF/xml/numbers_" + xmlLang + ".xml"); String xsltFileName = ctx.getRealPath( "/WEB-INF/xslt/numbers_" + xsltLang + ".xslt"); Because the XML files and XSLT stylesheets are named consistently, it is easy to determine the filenames. The next step is to set the content type of the response: // do this BEFORE calling HttpServletResponse.getWriter( ) res.setContentType("text/html; charset=" + charEnc); This is a critical step that instructs the servlet container to send the response to the client using the specified encoding type. This gets inserted into the Content-Type: text/html; charset=ISO-8869-1 Content-Type: text/html; charset=UTF-8 Content-Type: text/html; charset=UTF-16 Next, the servlet uses the Source xmlSource = new StreamSource(new File(xmlFileName)); Source xsltSource = new StreamSource(new File(xsltFileName));
Source xmlSource = new StreamSource(new InputStreamReader( new FileInputStream(xmlFileName), "UTF-8")); For more information on how Java uses encodings, see the JavaDoc package description for the Our servlet then overrides the XSLT stylesheet's output encoding as follows: trans.setOutputProperty(OutputKeys.ENCODING, charEnc); This takes precedence over the encoding that was specified in the Finally, the servlet performs the transformation, sending the result tree to a // note: res.getWriter( ) will use the encoding type that was // specified earlier in the call to res.setContentType( ) trans.transform(xmlSource, new StreamResult(res.getWriter( ))); As the comment indicates, the servlet container should set up the
I18n Troubleshooting ChecklistHere are a few things to consider when problems occur. First, rule out obvious problems:
If these tests do not uncover the problem, try the following:
|