Reading the Header
HTTP servers provide a substantial amount of information in the header that precedes each response. For example, here's a typical HTTP header returned by an Apache web server:
HTTP/1.1 200 OK Date: Mon, 18 Oct 1999 20:06:48 GMT Server: Apache/1.3.4 (Unix) PHP/3.0.6 mod_perl/1.17 Last-Modified: Mon, 18 Oct 1999 12:58:21 GMT ETag: "1e05f2-89bb-380b196d" Accept-Ranges: bytes Content-Length: 35259 Connection: close Content-Type: text/html
There's a lot of information there. In general, an HTTP header may include the content type of the requested document, the length of the document in bytes, the character set in which the content is encoded, the date and time, the date the content expires, and the date the content was last modified. However, the information depends on the server; some servers send all this information for each request, others send some information, and a few don't send anything. The methods of this section allow you to query a The first six methods request specific, particularly common fields from the header. These are:
This method returns the MIME content type of the data. It relies on the web server to send a valid content type. (In a later section, we'll see how recalcitrant servers are handled.) It throws no exceptions and returns Or:
In this case, In practice, most servers don't include charset information in their Content-type headers, so this is of limited use.
The As usual, the This method returns a The content encoding is not the same as the character encoding. The character encoding is determined by the Content-type header or information internal to the document, and specifies how characters are specified in bytes. Content encoding specifies how the bytes are encoded in other bytes.
When subclassing The This is the time the document was sent as seen from the server; it may not agree with the time on your local machine. If the HTTP header does not include a Date field, Some documents have server-based expiration dates that indicate when the document should be deleted from the cache and reloaded from the server. The final date method, Here's the result when used to look at http://www.oracle.com:
The content type of the file at http://www.oracle.com is The last six methods requested specific fields from the header, but there's no theoretical limit to the number of header fields a message can contain. The next five methods inspect arbitrary fields in a header. Indeed, the methods of the last section are just thin wrappers over the methods discussed here; you can use these methods to get header fields that Java's designers did not plan for. If the requested header is found, it is returned. Otherwise, the method returns The To get the Date, Content-length, or Expires headers, you'd do the same:
These methods all return This method returns the key (that is, the field name: for example, This method returns the value of the nth header field. In HTTP, the request method is header field zero and the first actual header is one. Example 15-5 uses this method in conjunction with For example, here's the output when this program is run against http://www.oracle.com:
Besides Date, Last-modified, and Content-type headers, this server also provides Server, Title, and Link headers. Other servers may have different sets of headers.
This method first retrieves the header field specified by the You can use the methods of the This method retrieves the value of the header field In this code fragment, URLConnection
to find out what metadata the server has provided. Aside from HTTP, very few protocols use MIME headers (and technically speaking, even the HTTP header isn't actually a MIME header; it just looks a lot like one). When writing your own subclass of URLConnection
, it is often necessary to override these methods so that they return sensible values. The most important piece of information you may be lacking is the MIME content type. URLConnection
provides some utility methods that guess the data's content type based on its filename or the first few bytes of the data itself.
Retrieving Specific Header Fields
public String getContentType( )
null
if the content type isn't available. text/html
will be the most common content type you'll encounter when connecting to web servers. Other commonly used types include text/plain
, image/gif
, app/xml
, and image/jpeg
. If the content type is some form of text, then this header may also contain a character set part identifying the document's character encoding. For example:
Content-type: text/html; charset=UTF-8
Content-Type: text/xml; charset=iso-2022-jp
getContentType( )
returns the full value of the Content-type field, including the character encoding. We can use this to improve on Example 15-1 by using the encoding specified in the HTTP header to decode the document, or ISO-8859-1 (the HTTP default) if no such encoding is specified. If a nontext type is encountered, an exception is thrown. Example 15-2 demonstrates:
Example 15-2. Download a web page with the correct character set
import java.net.*;
import java.io.*;
public class EncodingAwareSourceViewer {
public static void main (String[] args) {
for (int i = 0; i < args.length; i++) { try {
// set default encoding
String encoding = "ISO-8859-1";
URL u = new URL(args[i]);
URLConnection uc = u.openConnection( );
String contentType = uc.getContentType( );
int encodingStart = contentType.indexOf("charset=");
if (encodingStart != -1) {
encoding = contentType.substring(encodingStart+8);
}
InputStream in = new BufferedInputStream(uc.getInputStream( )); Reader r = new InputStreamReader(in, encoding);
int c;
while ((c = r.read( )) != -1) {
System.out.print((char) c);
} }
catch (MalformedURLException ex) {
System.err.println(args[0] + " is not a parseable URL");
}
catch (IOException ex) {
System.err.println(ex);
}
} // end if
} // end main
} // end EncodingAwareSourceViewer
public int getContentLength( )
getContentLength()
method tells you how many bytes there are in the content. Many servers send Content-length headers only when they're transferring a binary file, not when transferring a text file. If there is no Content-length header, getContentLength()
returns -1. The method throws no exceptions. It is used when you need to know exactly how many bytes to read or when you need to create a buffer large enough to hold the data in advance. In , we discussed how to use the openStream( )
method of the URL
class to download text files from an HTTP server. Although in theory you should be able to use the same method to download a binary file, such as a GIF image or a .class byte code file, in practice this procedure presents a problem. HTTP servers don't always close the connection exactly where the data is finished; therefore, you don't know when to stop reading. To download a binary file, it is more reliable to use a URLConnection
's getContentLength( )
method to find the file's length, then read exactly the number of bytes indicated. Example 15-3 is a program that uses this technique to save a binary file on a disk.
Example 15-3. Downloading a binary file from a web site and saving it to disk
import java.net.*;
import java.io.*;
public class BinarySaver {
public static void main (String args[]) {
for (int i = 0; i < args.length; i++) {
try {
URL root = new URL(args[i]);
saveBinaryFile(root);
}
catch (MalformedURLException ex) {
System.err.println(args[i] + " is not URL I understand.");
}
catch (IOException ex) {
System.err.println(ex);
}
} // end for
} // end main
public static void saveBinaryFile(URL u) throws IOException {
URLConnection uc = u.openConnection( );
String contentType = uc.getContentType( );
int contentLength = uc.getContentLength( );
if (contentType.startsWith("text/") || contentLength == -1 ) {
throw new IOException("This is not a binary file.");
}
InputStream raw = uc.getInputStream( );
InputStream in = new BufferedInputStream(raw);
byte[] data = new byte[contentLength];
int bytesRead = 0;
int offset = 0;
while (offset < contentLength) {
bytesRead = in.read(data, offset, data.length-offset);
if (bytesRead == -1) break;
offset += bytesRead;
}
in.close( );
if (offset != contentLength) {
throw new IOException("Only read " + offset + " bytes; Expected " + contentLength + " bytes");
}
String filename = u.getFile( );
filename = filename.substring(filename.lastIndexOf('/') + 1);
FileOutputStream fout = new FileOutputStream(filename);
fout.write(data);
fout.flush( );
fout.close( );
} } // end BinarySaver
main( )
method loops over the URLs entered on the command line, passing each URL to the saveBinaryFile( )
method. saveBinaryFile()
opens a URLConnection
uc
to the URL
. It puts the type into the variable contentType
and the content length into the variable contentLength
. Next, an if
statement checks whether the content type is text
or the Content-length field is missing or invalid (contentLength
== -1
). If either of these is true
, an IOException
is thrown. If these assertions are both false
, we have a binary file of known length: that's what we want. Now that we have a genuine binary file on our hands, we prepare to read it into an array of bytes called data
. data
is initialized to the number of bytes required to hold the binary object, contentLength
. Ideally, you would like to fill data
with a single call to read( )
but you probably won't get all the bytes at once, so the read is placed in a loop. The number of bytes read up to this point is accumulated into the offset
variable, which also keeps track of the location in the data
array at which to start placing the data retrieved by the next call to read( )
. The loop continues until offset
equals or exceeds contentLength
; that is, the array has been filled with the expected number of bytes. We also break out of the while
loop if read( )
returns -1, indicating an unexpected end of stream. The offset
variable now contains the total number of bytes read, which should be equal to the content length. If they are not equal, an error has occurred, so saveBinaryFile()
throws an IOException
. This is the general procedure for reading binary files from HTTP connections. Now we are ready to save the data in a file. saveBinaryFile()
gets the filename from the URL using the getFile( )
method and strips any path information by calling filename.substring(theFile.lastIndexOf('/')
+
1)
. A new FileOutputStream
fout
is opened into this file and the data is written in one large burst with fout.write(b)
.
public String getContentEncoding( )
String
that tells you how the content is encoded. If the content is sent unencoded (as is commonly the case with HTTP servers), this method returns null
. It throws no exceptions. The most commonly used content encoding on the Web is probably x-gzip, which can be straightforwardly decoded using a java.util.zip.GZipInputStream
.
URLConnection
, override this method if you expect to be dealing with encoded data, as might be the case for an NNTP or SMTP protocol handler; in these apps, many different encoding schemes, such as BinHex and uuencode, are used to pass eight-bit binary data through a seven-bit ASCII connection.
public long getDate( )
getDate( )
method returns a long
that tells you when the document was sent, in milliseconds since midnight, Greenwich Mean Time (GMT), January 1, 1970. You can convert it to a java.util.Date
. For example:
Date documentSent = new Date(uc.getDate( ));
getDate( )
returns 0.
public long getExpiration( )
getExpiration( )
is very similar to getDate( )
, differing only in how the return value is interpreted. It returns a long
indicating the number of milliseconds after 12:00 A.M., GMT, January 1, 1970, at which point the document expires. If the HTTP header does not include an Expiration field, getExpiration( )
returns 0, which means 12:00 A.M., GMT, January 1, 1970. The only reasonable interpretation of this date is that the document does not expire and can remain in the cache indefinitely.
public long getLastModified( )
getLastModified( )
, returns the date on which the document was last modified. Again, the date is given as the number of milliseconds since midnight, GMT, January 1, 1970. If the HTTP header does not include a Last-modified field (and many don't), this method returns 0. Example 15-4 reads URLs from the command line and uses these six methods to print their content type, content length, content encoding, date of last modification, expiration date, and current date.
Example 15-4. Return the header
import java.net.*;
import java.io.*;
import java.util.*;
public class HeaderViewer {
public static void main(String args[]) {
for (int i=0; i < args.length; i++) {
try {
URL u = new URL(args[0]);
URLConnection uc = u.openConnection( );
System.out.println("Content-type: " + uc.getContentType( ));
System.out.println("Content-encoding: " + uc.getContentEncoding( ));
System.out.println("Date: " + new Date(uc.getDate( )));
System.out.println("Last modified: " + new Date(uc.getLastModified( )));
System.out.println("Expiration date: " + new Date(uc.getExpiration( )));
System.out.println("Content-length: " + uc.getContentLength( ));
} // end try
catch (MalformedURLException ex) {
System.err.println(args[i] + " is not a URL I understand");
}
catch (IOException ex) {
System.err.println(ex);
} System.out.println( ); } // end for
} // end main
} // end HeaderViewer
% java HeaderViewer http://www.oracle.com
Content-type: text/html Content-encoding: null Date: Mon Oct 18 13:54:52 PDT 1999
Last modified: Sat Oct 16 07:54:02 PDT 1999
Expiration date: Wed Dec 31 16:00:00 PST 1969
Content-length: -1
text/html
. No content encoding was used. The file was sent on Monday, October 18, 1999 at 1:54 P.M., Pacific Daylight Time. It was last modified on Saturday, October 16, 1999 at 7:54 A.M. Pacific Daylight Time and it expires on Wednesday, December 31, 1969 at 4:00 P. M., Pacific Standard Time. Did this document really expire 31 years ago? No. Remember that what's being checked here is whether the copy in your cache is more recent than 4:00 P.M. PST, December 31, 1969. If it is, you don't need to reload it. More to the point, after adjusting for time zone differences, this date looks suspiciously like 12:00 A.M., Greenwich Mean Time, January 1, 1970, which happens to be the default if the server doesn't send an expiration date. (Most don't.) Finally, the content length of -1 means that there was no Content-length header. Many servers don't bother to provide a Content-length header for text files. However, a Content-length header should always be sent for a binary file. Here's the HTTP header you get when you request the GIF image http://www.oracle.com/graphics/space.gif. Now the server sends a Content-length header with a value of 57.
% java HeaderViewer http://www.oracle.com/graphics/space.gif
Content-type: image/gif Content-encoding: null Date: Mon Oct 18 14:00:07 PDT 1999
Last modified: Thu Jan 09 12:05:11 PST 1997
Expiration date: Wed Dec 31 16:00:00 PST 1969
Content-length: 57
Retrieving Arbitrary Header Fields
null
.
public String getHeaderField(String name)
getHeaderField()
method returns the value of a named header field. The name of the header is not case-sensitive and does not include a closing colon. For example, to get the value of the Content-type and Content-encoding header fields of a URLConnection
object uc
, you could write:
String contentType = uc.getHeaderField("content-type");
String contentEncoding = uc.getHeaderField("content-encoding"));
String data = uc.getHeaderField("date");
String expires = uc.getHeaderField("expires");
String contentLength = uc.getHeaderField("Content-length");
String
, not int
or long
as the getContentLength( )
, getExpirationDate()
, getLastModified( )
, and getDate( )
methods of the last section did. If you're interested in a numeric value, convert the String
to a long
or an int
. Do not assume the value returned by getHeaderField()
is valid. You must check to make sure it is non-null.
public String getHeaderFieldKey(int n)
Content-length
or Server
) of the nth header field. The request method is header zero and has a null key. The first header is one. For example, to get the sixth key of the header of the URLConnection
uc
, you would write:
String header6 = uc.getHeaderFieldKey(6);
public String getHeaderField(int n)
getHeaderFieldKey( )
to print the entire HTTP header.
Example 15-5. Print the entire HTTP header
import java.net.*;
import java.io.*;
public class AllHeaders {
public static void main(String args[]) {
for (int i=0; i < args.length; i++) {
try {
URL u = new URL(args[i]);
URLConnection uc = u.openConnection( );
for (int j = 1; ; j++) {
String header = uc.getHeaderField(j);
if (header == null) break;
System.out.println(uc.getHeaderFieldKey(j) + ": " + header);
} // end for
} // end try
catch (MalformedURLException ex) {
System.err.println(args[i] + " is not a URL I understand.");
}
catch (IOException ex) {
System.err.println(ex);
}
System.out.println( );
} // end for
} // end main
} // end AllHeaders
% java AllHeaders http://www.oracle.com
Server: WN/1.15.1
Date: Mon, 18 Oct 1999 21:20:26 GMT Last-modified: Sat, 16 Oct 1999 14:54:02 GMT Content-type: text/html Title: www.oracle.com -- Welcome to Oracle & Associates! -- computer tutorials, software, online publishing Link: <mailto:webmaster@bugmenot.com>; rev="Made"
public long getHeaderFieldDate(String name, long default)
name
argument and tries to convert the string to a long
that specifies the milliseconds since midnight, January 1, 1970, GMT. getHeaderFieldDate()
can be used to retrieve a header field that represents a date: for example, the Expires, Date, or Last-modified headers. To convert the string to an integer, getHeaderFieldDate()
uses the parseDate( )
method of java.util.Date
. The parseDate()
method does a decent job of understanding and converting most common date formats, but it can be stumped-for instance, if you ask for a header field that contains something other than a date. If parseDate( )
doesn't understand the date or if getHeaderFieldDate( )
is unable to find the requested header field, getHeaderFieldDate( )
returns the default
argument. For example:
Date expires = new Date(uc.getHeaderFieldDate("expires", 0));
long lastModified = uc.getHeaderFieldDate("last-modified", 0);
Date now = new Date(uc.getHeaderFieldDate("date", 0));
java.util.Date
class to convert the long
to a String
.
public int getHeaderFieldInt(String name, int default)
name
and tries to convert it to an int
. If it fails, either because it can't find the requested header field or because that field does not contain a recognizable integer, getHeaderFieldInt( )
returns the default
argument. This method is often used to retrieve the Content-length
field. For example, to get the content length from a URLConnection
uc
, you would write:
int contentLength = uc.getHeaderFieldInt("content-length", -1);
getHeaderFieldInt( )
returns -1 if the Content-length header isn't present.