Guessing MIME Content Types
If this were the best of all possible worlds, every protocol and every server would use MIME types to specify the kind of file being transferred. Unfortunately, that's not the case. Not only do we have to deal with older protocols such as FTP that predate MIME, but many HTTP servers that should use MIME don't provide MIME headers at all or lie and provide headers that are incorrect (usually because the server has been misconfigured). The URLConnection
class provides two static methods to help programs figure out the MIME type of some data; you can use these if the content type just isn't available or if you have reason to believe that the content type you're given isn't correct. The first of these is URLConnection.guessContentTypeFromName()
:
public static String guessContentTypeFromName(String name)[1]
[1] This method is protected in Java 1.3 and earlier, public in Java 1.4 and later.
This method tries to guess the content type of an object based upon the extension in the filename portion of the object's URL. It returns its best guess about the content type as a Extension
MIME content type
No extension, or unrecognized extension
.saveme, .dump, .hqx, .arc, .o, .a, .z, .bin, .exe, .zip, .gz
.oda
.pdf
.eps, .ai, .ps
.dvi
.hdf
.latex
.nc, .cdf
.tex
.texinfo, .texi
.t, .tr, .roff
.man
.me
.ms
.src, .wsrc
.zip
.bcpio
.cpio
.gtar
.sh, .shar
.sv4cpio
.sv4crc
.tar
.ustar
.snd, .au
.aifc, .aif, .aiff
.wav
.gif
.ief
.jfif, .jfif-tbnl, .jpe, .jpg, .jpeg
.tif, .tiff
.fpx, .fpix
.ras
.pnm
.pbm
.pgm
.ppm
.rgb
.xbm, .xpm
.xwd
.png
.htm, .html
.text, .c, .cc, .c++, .h, .pl, .txt, .java, .el
.tsv
.etx
.mpg, .mpe, .mpeg
.mov, .qt
.avi
.movie, .mv
.mime
.xml
This list is not complete by any means. For instance, it omits various XML apps such as RDF (.rdf), XSL (.xsl), and so on that should have the MIME type This method tries to guess the content type by looking at the first few bytes of data in the stream. For this method to work, the First bytes in hexadecimal
First bytes in ASCII
MIME content type
0xACED
0xCAFEBABE
0x47494638
GIF8
0x23646566
#def
0x2158504D32
!XPM2
0x89504E 470D0A1A0A
0x2E736E64
0x646E732E
0x3C3F786D6C
<?xml
0xFEFF003C003F00F7
0xFFFE3C003F00F700
0x3C21
<!
0x3C68746D6C
<html
0x3C626F6479
<body
0x3C68656164
<head
0x3C48544D4C
<HTML
0x3C424F4459
<BODY
0x3C48454144
<HEAD
0xFFD8FFE0
0xFFD8FFEE
0xFFD8FFE1XXXX4578696600[2]
0x89504E470D0A1A0A
0x52494646
RIFF
0xD0CF11E0A1B11AE1[3]
[2] The XX bytes are not checked. They can be anything. [3] This actually just checks for a Microsoft structured storage document. Several other more complicated checks have to be made before deciding whether this is indeed an ASCII mappings, where they exist, are case-sensitive. For example, String
. This guess is likely to be correct; people follow some fairly regular conventions when thinking up filenames. The guesses are determined by the content-types.properties file, normally located in the jre/lib directory. On Unix, Java may also look at the mailcap file to help it guess. Table 15-1 shows the guesses the JDK 1.5 makes. These vary a little from one version of the JDK to the next.
Table 15-1. Java extension content-type mappings
content/unknown
app/octet-stream
app/oda
app/pdf
app/postscript
app/x-dvi
app/x-hdf
app/x-latex
app/x-netcdf
app/x-tex
:
app/x-texinfo
app/x-troff
app/x-troff-man
app/x-troff-me
app/x-troff-ms
app/x-wais-source
app/zip
app/x-bcpio
app/x-cpio
app/x-gtar
app/x-shar
app/x-sv4cpio
:
app/x-sv4crc
app/x-tar
app/x-ustar
audio/basic
audio/x-aiff
audio/x-wav
image/gif
image/ief
image/jpeg
image/tiff
image/vnd.fpx
image/x-cmu-rast
image/x-portable-anymap
image/x-portable-bitmap
image/x-portable-graymap
image/x-portable-pixmap
image/x-rgb
image/x-xbitmap
image/x-xwindowdump
image/png
text/html
text/plain
text/tab-separated-values
text/x-setext
video/mpeg
video/quicktime
app/x-troff-msvideo
video/x-sgi-movie
message/rfc822
app/xml
app/xml
. It also doesn't provide a MIME type for CSS stylesheets (.css). However, it's a good start. The second MIME type guesser method is URLConnection.guessContentTypeFromStream()
:
public static String guessContentTypeFromStream(InputStream in)
InputStream
must support marking so that you can return to the beginning of the stream after the first bytes have been read. Java 1.5 inspects the first 11 bytes of the InputStream
, although sometimes fewer bytes are needed to make an identification. Table 15-2 shows how Java 1.5 guesses. Note that these guesses are often not as reliable as the guesses made by the previous method. For example, an XML document that begins with a comment rather than an XML declaration would be mislabeled as an HTML file. This method should be used only as a last resort.Table 15-2. Java first bytes content-type mappings
app/x-java-serialized-object
app/java-vm
image/gif
image/x-bitmap
image/x-pixmap
image/png
audio/basic
audio/basic
app/xml
app/xml
app/xml
text/html
text/html
text/html
text/html
text/html
text/html
text/html
image/jpeg
image/jpeg
image/jpeg
image/png
audio/x-wav
image/vnd.fpx
image/vnd.fpx
document.guessContentTypeFromStream( )
does not recognize <Html>
as the beginning of a text/html
file.