Guessing MIME Content Types

If this were the best of all possible worlds, every protocol and every server would use MIME types to specify the kind of file being transferred. Unfortunately, that's not the case. Not only do we have to deal with older protocols such as FTP that predate MIME, but many HTTP servers that should use MIME don't provide MIME headers at all or lie and provide headers that are incorrect (usually because the server has been misconfigured). The URLConnection class provides two static methods to help programs figure out the MIME type of some data; you can use these if the content type just isn't available or if you have reason to believe that the content type you're given isn't correct. The first of these is URLConnection.guessContentTypeFromName():

public static String guessContentTypeFromName(String name)^[1]

^[1] This method is protected in Java 1.3 and earlier, public in Java 1.4 and later.

This method tries to guess the content type of an object based upon the extension in the filename portion of the object's URL. It returns its best guess about the content type as a String. This guess is likely to be correct; people follow some fairly regular conventions when thinking up filenames. The guesses are determined by the content-types.properties file, normally located in the jre/lib directory. On Unix, Java may also look at the mailcap file to help it guess. Table 15-1 shows the guesses the JDK 1.5 makes. These vary a little from one version of the JDK to the next.

Table 15-1. Java extension content-type mappings

Extension	MIME content type
No extension, or unrecognized extension	`content/unknown`
.saveme, .dump, .hqx, .arc, .o, .a, .z, .bin, .exe, .zip, .gz	`app/octet-stream`
.oda	`app/oda`
.pdf	`app/pdf`
.eps, .ai, .ps	`app/postscript`
.dvi	`app/x-dvi`
.hdf	`app/x-hdf`
.latex	`app/x-latex`
.nc, .cdf	`app/x-netcdf`
.tex	`app/x-tex`:
.texinfo, .texi	`app/x-texinfo`
.t, .tr, .roff	`app/x-troff`
.man	`app/x-troff-man`
.me	`app/x-troff-me`
.ms	`app/x-troff-ms`
.src, .wsrc	`app/x-wais-source`
.zip	`app/zip`
.bcpio	`app/x-bcpio`
.cpio	`app/x-cpio`
.gtar	`app/x-gtar`
.sh, .shar	`app/x-shar`
.sv4cpio	`app/x-sv4cpio`:
.sv4crc	`app/x-sv4crc`
.tar	`app/x-tar`
.ustar	`app/x-ustar`
.snd, .au	`audio/basic`
.aifc, .aif, .aiff	`audio/x-aiff`
.wav	`audio/x-wav`
.gif	`image/gif`
.ief	`image/ief`
.jfif, .jfif-tbnl, .jpe, .jpg, .jpeg	`image/jpeg`
.tif, .tiff	`image/tiff`
.fpx, .fpix	`image/vnd.fpx`
.ras	`image/x-cmu-rast`
.pnm	`image/x-portable-anymap`
.pbm	`image/x-portable-bitmap`
.pgm	`image/x-portable-graymap`
.ppm	`image/x-portable-pixmap`
.rgb	`image/x-rgb`
.xbm, .xpm	`image/x-xbitmap`
.xwd	`image/x-xwindowdump`
.png	`image/png`
.htm, .html	`text/html`
.text, .c, .cc, .c++, .h, .pl, .txt, .java, .el	`text/plain`
.tsv	`text/tab-separated-values`
.etx	`text/x-setext`
.mpg, .mpe, .mpeg	`video/mpeg`
.mov, .qt	`video/quicktime`
.avi	`app/x-troff-msvideo`
.movie, .mv	`video/x-sgi-movie`
.mime	`message/rfc822`
.xml	`app/xml`

This list is not complete by any means. For instance, it omits various XML apps such as RDF (.rdf), XSL (.xsl), and so on that should have the MIME type app/xml. It also doesn't provide a MIME type for CSS stylesheets (.css). However, it's a good start. The second MIME type guesser method is URLConnection.guessContentTypeFromStream():

public static String guessContentTypeFromStream(InputStream in)

This method tries to guess the content type by looking at the first few bytes of data in the stream. For this method to work, the InputStream must support marking so that you can return to the beginning of the stream after the first bytes have been read. Java 1.5 inspects the first 11 bytes of the InputStream, although sometimes fewer bytes are needed to make an identification. Table 15-2 shows how Java 1.5 guesses. Note that these guesses are often not as reliable as the guesses made by the previous method. For example, an XML document that begins with a comment rather than an XML declaration would be mislabeled as an HTML file. This method should be used only as a last resort.

Table 15-2. Java first bytes content-type mappings

First bytes in hexadecimal	First bytes in ASCII	MIME content type
0xACED		`app/x-java-serialized-object`
0xCAFEBABE		`app/java-vm`
0x47494638	GIF8	`image/gif`
0x23646566	#def	`image/x-bitmap`
0x2158504D32	!XPM2	`image/x-pixmap`
0x89504E 470D0A1A0A		`image/png`
0x2E736E64		`audio/basic`
0x646E732E		`audio/basic`
0x3C3F786D6C	<?xml	`app/xml`
0xFEFF003C003F00F7		`app/xml`
0xFFFE3C003F00F700		`app/xml`
0x3C21	<!	`text/html`
0x3C68746D6C	<html	`text/html`
0x3C626F6479	<body	`text/html`
0x3C68656164	<head	`text/html`
0x3C48544D4C	<HTML	`text/html`
0x3C424F4459	<BODY	`text/html`
0x3C48454144	<HEAD	`text/html`
0xFFD8FFE0		`image/jpeg`
0xFFD8FFEE		`image/jpeg`
0xFFD8FFE1XXXX4578696600^[2]		`image/jpeg`
0x89504E470D0A1A0A		`image/png`
0x52494646	RIFF	`audio/x-wav`
0xD0CF11E0A1B11AE1^[3]		`image/vnd.fpx`

^[2] The XX bytes are not checked. They can be anything.

^[3] This actually just checks for a Microsoft structured storage document. Several other more complicated checks have to be made before deciding whether this is indeed an image/vnd.fpx document.

ASCII mappings, where they exist, are case-sensitive. For example, guessContentTypeFromStream( ) does not recognize <Html> as the beginning of a text/html file.