Several log formats have become standard, and we'll discuss some of the most common formats in this section. Most commercial and open source HTTP applications support logging in one or more of these common formats. Many of these applications also support the ability of administrators to configure log formats and create their own custom formats.

One of the main benefits of supporting (for applications) and using (for administrators) these more standard formats rests in the ability to leverage the tools that have been built to process and generate basic statistics from these logs. Many open source and commercial packages exist to crunch logs for reporting purposes, and by utilizing standard formats, applications and their administrators can plug into these resources.

Common Log Format

One of the most common log formats in use today is called, appropriately, the Common Log Format. Originally defined by NCSA, many servers use this log format as a default. Most commercial and open source servers can be configured to use this format, and many commercial and freeware tools exist to help parse common log files. Table 21-1 lists, in order, the fields of the Common Log Format.

Table 21-1. Common Log Format fields

Field Description
remotehost The hostname or IP address of the requestor's machine (IP if the server was not configured to perform reverse DNS or cannot look up the requestor's hostname)
username If an ident lookup was performed, the requestor's authenticated username
auth-username If authentication was performed, the username with which the requestor authenticated
timestamp The date and time of the request
request-line The exact text of the HTTP request line, "GET /index.html HTTP/1.1"
response-code The HTTP status code that was returned in the response
response-size The Content-Length of the response entity-if no entity was returned in the response, a zero is logged

RFC 931 describes the ident lookup used in this authentication. The ident protocol was discussed in Chapter 5.

Example 21-1 lists a few examples of Common Log Format entries.

Example 21-1. Common Log Format

209.1.32.44 - - [03/Oct/1999:14:16:00 -0400] "GET / HTTP/1.0" 200 1024
http-guide.com - dg [03/Oct/1999:14:16:32 -0400] "GET / HTTP/1.0" 200 477
http-guide.com - dg [03/Oct/1999:14:16:32 -0400] "GET /foo HTTP/1.0" 404 0

In these examples, the fields are assigned as follows:

Field Entry 1 Entry 2 Entry 2
remotehost 209.1.32.44 http-guide.com http-guide.com
username <empty> <empty> <empty>
auth-username <empty> dg dg
timestamp 03/Oct/1999:14:16:00 -0400 03/Oct/1999:14:16:32 -0400 03/Oct/1999:14:16:32 -0400
request-line GET / HTTP/1.0 GET / HTTP/1.0 GET /foo HTTP/1.0
response-code 200 200 404
response-size 1024 477 0

Note that the remotehost field can be either a hostname, as in http-guide.com, or an IP address, such as 209.1.32.44.

The dashes in the second (username) and third (auth-username) fields indicate that the fields are empty. This indicates that either an ident lookup did not occur (second field empty) or authentication was not performed (third field empty).

Combined Log Format

Another commonly used log format is the Combined Log Format. This format is supported by servers such as Apache. The Combined Log Format is very similar to the Common Log Format; in fact, it mirrors it exactly, with the addition of two fields (listed in Table 21-2). The User-Agent field is useful in noting which HTTP client applications are making the logged requests, while the Referer field provides more detail about where the requestor found this URL.

Table 21-2. Additional Combined Log Format fields

Field Description
Referer The contents of the Referer HTTP header
User-Agent The contents of the User-Agent HTTP header

Example 21-2 gives an example of a Combined Log Format entry.

Example 21-2. Combined Log Format

209.1.32.44 - - [03/Oct/1999:14:16:00 -0400] "GET / HTTP/1.0" 200 1024 "http://www.joes-
hardware.com/" "5.0: Mozilla/4.0 (compatible; MSIE 5.0; Windows 98)"

In Example 21-2, the Referer and User-Agent fields are assigned as follows:

Field Value
Referer http://www.joes-hardware.com/
User-Agent 5.0: Mozilla/4.0 (compatible; MSIE 5.0; Windows 98)

The first seven fields of the example Combined Log Format entry in Example 21-2 are exactly as they would be in the Common Log Format (see the first entry in Example 21-1). The two new fields, Referer and User-Agent, are tacked onto the end of the log entry.

Netscape Extended Log Format

When Netscape entered into the commercial HTTP application space, it defined for its servers many log formats that have been adopted by other HTTP application developers. Netscape's formats derive from the NCSA Common Log Format, but they extend that format to incorporate fields relevant to HTTP applications such as proxies and web caches.

The first seven fields in the Netscape Extended Log Format are identical to those in the Common Log Format (see Table 21-1). Table 21-3 lists, in order, the new fields that the Netscape Extended Log Format introduces.

Table 21-3. Additional Netscape Extended Log Format fields

Field Description
proxy-response-code If the transaction went through a proxy, the HTTP response code from the server to the proxy
proxy-response-size If the transaction went through a proxy, the Content-Length of the server's response entity sent to the proxy
client-request-size The Content-Length of any body or entity in the client's request to the proxy
proxy-request-size If the transaction went through a proxy, the Content-Length of any body or entity in the proxy's request to the server
client-request-hdr-size The length, in bytes, of the client's request headers
proxy-response-hdr-size If the transaction went through a proxy, the length, in bytes, of the proxy's response headers that were sent to the requestor
proxy-request-hdr-size If the transaction went through a proxy, the length, in bytes, of the proxy's request headers that were sent to the server
server-response-hdr-size The length, in bytes, of the server's response headers
proxy-timestamp If the transaction went through a proxy, the elapsed time for the request and response to travel through the proxy, in seconds

Example 21-3 gives an example of a Netscape Extended Log Format entry.

Example 21-3. Netscape Extended Log Format

209.1.32.44 - - [03/Oct/1999:14:16:00-0400] "GET / HTTP/1.0" 200 1024 200 1024 0 0 215 260 
279 254 3

In this example, the extended fields are assigned as follows:

Field Value
proxy-response-code 200
proxy-response-size 1024
client-request-size 0
proxy-request-size 0
client-request-hdr-size 215
proxy-response-hdr-size 260
proxy-request-hdr-size 279
server-response-hdr-size 254
proxy-timestamp 3

The first seven fields of the example Netscape Extended Log Format entry in Example 21-3 mirror the entries in the Common Log Format example (see the first entry in Example 21-1).

Netscape Extended 2 Log Format

Another Netscape log format, the Netscape Extended 2 Log Format, takes the Extended Log Format and adds further information relevant to HTTP proxy and web caching applications. These extra fields help paint a better picture of the interactions between an HTTP client and an HTTP proxy application.

The Netscape Extended 2 Log Format derives from the Netscape Extended Log Format, and its initial fields are identical to those listed in Table 21-3 (it also extends the Common Log Format fields listed in Table 21-1).

Table 21-4 lists, in order, the additional fields of the Netscape Extended 2 Log Format.

Table 21-4. Additional Netscape Extended 2 Log Format fields

Field Description
route The route that the proxy used to make the request for the client (see Table 21-5)
client-finish-status-code The client finish status code; specifies whether the client request to the proxy completed successfully (FIN) or was interrupted (INTR)
proxy-finish-status-code The proxy finish status code; specifies whether the proxy request to the server completed successfully (FIN) or was interrupted (INTR)
cache-result-code The cache result code; tells how the cache responded to the request

Table 21-7 lists the Netscape cache result codes.

Example 21-4 gives an example of a Netscape Extended 2 Log Format entry.

Example 21-4. Netscape Extended 2 Log Format

209.1.32.44 - - [03/Oct/1999:14:16:00-0400] "GET / HTTP/1.0" 200 1024 200 1024 0 0 215 260 
279 254 3 DIRECT FIN FIN WRITTEN

The extended fields in this example are assigned as follows:

Field Value
route DIRECT
client-finish-status-code FIN
proxy-finish-status-code FIN
cache-result-code WRITTEN

The first 16 fields in the Netscape Extended 2 Log Format entry in Example 21-4 mirror the entries in the Netscape Extended Log Format example (see Example 21-3).

Table 21-5 lists the valid Netscape route codes.

Table 21-5. Netscape route codes

Value Description
DIRECT The resource was fetched directly from the server.
PROXY(host:port) The resource was fetched through the proxy "host."
SOCKS(socks:port) The resource was fetched through the SOCKS server "host."

Table 21-6 lists the valid Netscape finish codes.

Table 21-6. Netscape finish status codes

Value Description
- The request never even started.
FIN The request was completed successfully.
INTR The request was interrupted by the client or ended by a proxy/server.
TIMEOUT The request was timed out by the proxy/server.

Table 21-7 lists the valid Netscape cache codes.

Chapter 7 discusses HTTP caching in detail.

Table 21-7. Netscape cache codes

Code Description
- The resource was uncacheable.
WRITTEN The resource was written into the cache.
REFRESHED The resource was cached and it was refreshed.
NO-CHECK The cached resource was returned; no freshness check was done.
UP-TO-DATE The cached resource was returned; a freshness check was done.
HOST-NOT-AVAILABLE The cached resource was returned; no freshness check was done because the remote server was not available.
CL-MISMATCH The resource was not written to the cache; the write was aborted because the Content-Length did not match the resource size.
ERROR The resource was not written to the cache due to some error; for example, a timeout occurred or the client aborted the transaction.

Netscape applications, like many other HTTP applications, have other log formats too, including a Flexible Log Format and a means for administrators to output custom log fields. These formats allow administrators greater control and the ability to customize their logs by choosing which parts of the HTTP transaction (headers, status, sizes, etc.) to report in their logs.

The ability for administrators to configure custom formats was added because it is difficult to predict what information administrators will be interested in getting from their logs. Many other proxies and servers also have the ability to emit custom logs.

Squid Proxy Log Format

The Squid proxy cache (http://www.squid-cache.org) is a venerable part of the Web. Its roots trace back to one of the early web proxy cache projects (ftp://ftp.cs.colorado.edu/pub/techreports/schwartz/Harvest.Conf.ps.Z). Squid is an open source project that has been extended and enhanced by the open source community over the years. Many tools have been written to help administer the Squid application, including tools to help process, audit, and mine its logs. Many subsequent proxy caches adopted the Squid format for their own logs so that they could leverage these tools.

The format of a Squid log entry is fairly simple. Its fields are summarized in Table 21-8.

Table 21-8. Squid Log Format fields

Field Description
timestamp The timestamp when the request arrived, in seconds since January 1, 1970 GMT.
time-elapsed The elapsed time for request and response to travel through the proxy, in milliseconds.
host-ip The IP address of the client's (requestor's) host machine.
result-code/status The result field is a Squid-ism that tells what action the proxy took during this request ; the code field is the HTTP response code that the proxy sent to the client.
size The length of the proxy's response to the client, including HTTP response headers and body, in bytes.
method The HTTP method of the client's request.
url The URL in the client's request.
rfc931-ident The client's authenticated username.
hierarchy/from Like the route field in Netscape formats, the hierarchy field tells what route the proxy used to make the request for the client. The from field tells the name of the server that the proxy used to make the request.
content-type The Content-Type of the proxy response entity.

Table 21-9 lists the various result codes and their meanings.

Recall from Chapter 2 that proxies often log the entire requested URL, so if a username and password component are in the URL, a proxy can inadvertently record this information.

The rfc931-ident, hierarchy/from, and content-type fields were added in Squid 1.1. Previous versions did not have these fields.

RFC 931 describes the ident lookup used in this authentication.

http://squid.nlanr.net/Doc/FAQ/FAQ-6.html#ss6.6 lists all of the valid Squid hierarchy codes.

Example 21-5 gives an example of a Squid Log Format entry.

Example 21-5. Squid Log Format

99823414 3001 209.1.32.44 TCP_MISS/200 4087 GET http://www.joes-hardware.com - DIRECT/
proxy.com text/html

The fields are assigned as follows:

Field Value
timestamp 99823414
time-elapsed 3001
host-ip 209.1.32.44
action-code TCP_MISS
status 200
size 4087
method GET
URL http://www.joes-hardware.com
RFC 931 ident -
hierarchy DIRECT
from proxy.com
content-type text/html

The DIRECT Squid hierarchy value is the same as the DIRECT route value in Netscape log formats.

Table 21-9 lists the various Squid result codes.

Several of these action codes deal more with the internals of the Squid proxy cache, so not all of them are used by other proxies that implement the Squid Log Format.

Table 21-9. Squid result codes

Action Description
TCP_HIT A valid copy of the resource was served out of the cache.
TCP_MISS The resource was not in the cache.
TCP_REFRESH_HIT The resource was in the cache but needed to be checked for freshness. The proxy revalidated the resource with the server and found that the in-cache copy was indeed still fresh.
TCP_REF_FAIL_HIT The resource was in the cache but needed to be checked for freshness. However, the revalidation failed (perhaps the proxy could not connect to the server), so the "stale" resource was returned.
TCP_REFRESH_MISS The resource was in the cache but needed to be checked for freshness. Upon checking with the server, the proxy learned that the resource in the cache was out of date and received a new version.
TCP_CLIENT_REFRESH_MISS The requestor sent a Pragma: no-cache or similar Cache-Control directive, so the proxy was forced to fetch the resource.
TCP_IMS_HIT The requestor issued a conditional request, which was validated against the cached copy of the resource.
TCP_SWAPFAIL_MISS The proxy thought the resource was in the cache but for some reason could not access it.
TCP_NEGATIVE_HIT A cached response was returned, but the response was a negatively cached response. Squid supports the notion of caching errors for resources-for example, caching a 404 Not Found response-so if multiple requests go through the proxy-cache for an invalid resource, the error is served from the proxy cache.
TCP_MEM_HIT A valid copy of the resource was served out of the cache, and the resource was in the proxy cache's memory (as opposed to having to access the disk to retrieve the cached resource).
TCP_DENIED The request for this resource was denied, probably because the requestor does not have permission to make requests for this resource.
TCP_OFFLINE_HIT The requested resource was retrieved from the cache during its offline mode. Resources are not validated when Squid (or another proxy using this format) is in offline mode.
UDP_* The UDP_* codes indicate that requests were received through the UDP interface to the proxy. HTTP normally uses the TCP transport protocol, so these requests are not using the HTTP protocol.
UDP_HIT A valid copy of the resource was served out of the cache.
UDP_MISS The resource was not in the cache.
UDP_DENIED The request for this resource was denied, probably because the requestor does not have permission to make requests for this resource.
UDP_INVALID The request that the proxy received was invalid.
UDP_MISS_NOFETCH Used by Squid during specific operation modes or in the cache of frequent failures. A cache miss was returned and the resource was not fetched.
NONE Logged sometimes with errors.
TCP_CLIENT_REFRESH See TCP_CLIENT_REFRESH_MISS.
TCP_SWAPFAIL See TCP_SWAPFAIL_MISS.
UDP_RELOADING See UDP_MISS_NOFETCH.

Squid has its own protocol for making these requests: ICP. This protocol is used for cache-to-cache requests. See http://www.squid-cache.org for more information.

 


Hypertext Transfer Protocol (HTTP)