Log Formats
Several log formats have become standard, and we'll discuss some of the most common formats in this section. Most commercial and open source HTTP applications support logging in one or more of these common formats. Many of these applications also support the ability of administrators to configure log formats and create their own custom formats.
One of the main benefits of supporting (for applications) and using (for administrators) these more standard formats rests in the ability to leverage the tools that have been built to process and generate basic statistics from these logs. Many open source and commercial packages exist to crunch logs for reporting purposes, and by utilizing standard formats, applications and their administrators can plug into these resources.
Common Log Format
One of the most common log formats in use today is called, appropriately, the Common Log Format. Originally defined by NCSA, many servers use this log format as a default. Most commercial and open source servers can be configured to use this format, and many commercial and freeware tools exist to help parse common log files. Table 21-1 lists, in order, the fields of the Common Log Format.
Table 21-1. Common Log Format fields | |
Field | Description |
remotehost | The hostname or IP address of the requestor's machine (IP if the server was not configured to perform reverse DNS or cannot look up the requestor's hostname) |
username | If an ident lookup was performed, the requestor's authenticated username |
auth-username | If authentication was performed, the username with which the requestor authenticated |
timestamp | The date and time of the request |
request-line | The exact text of the HTTP request line, "GET /index.html HTTP/1.1" |
response-code | The HTTP status code that was returned in the response |
response-size | The Content-Length of the response entity-if no entity was returned in the response, a zero is logged |
RFC 931 describes the ident lookup used in this authentication. The ident protocol was discussed in Chapter 5.
Example 21-1 lists a few examples of Common Log Format entries.
Example 21-1. Common Log Format
209.1.32.44 - - [03/Oct/1999:14:16:00 -0400] "GET / HTTP/1.0" 200 1024
http-guide.com - dg [03/Oct/1999:14:16:32 -0400] "GET / HTTP/1.0" 200 477
http-guide.com - dg [03/Oct/1999:14:16:32 -0400] "GET /foo HTTP/1.0" 404 0
In these examples, the fields are assigned as follows:
Field | Entry 1 | Entry 2 | Entry 2 |
remotehost | 209.1.32.44 | http-guide.com | http-guide.com |
username | <empty> | <empty> | <empty> |
auth-username | <empty> | dg | dg |
timestamp | 03/Oct/1999:14:16:00 -0400 | 03/Oct/1999:14:16:32 -0400 | 03/Oct/1999:14:16:32 -0400 |
request-line | GET / HTTP/1.0 | GET / HTTP/1.0 | GET /foo HTTP/1.0 |
response-code | 200 | 200 | 404 |
response-size | 1024 | 477 | 0 |
Note that the remotehost field can be either a hostname, as in http-guide.com, or an IP address, such as 209.1.32.44.
The dashes in the second (username) and third (auth-username) fields indicate that the fields are empty. This indicates that either an ident lookup did not occur (second field empty) or authentication was not performed (third field empty).
Combined Log Format
Another commonly used log format is the Combined Log Format. This format is supported by servers such as Apache. The Combined Log Format is very similar to the Common Log Format; in fact, it mirrors it exactly, with the addition of two fields (listed in Table 21-2). The User-Agent field is useful in noting which HTTP client applications are making the logged requests, while the Referer field provides more detail about where the requestor found this URL.
Table 21-2. Additional Combined Log Format fields | |
Field | Description |
Referer | The contents of the Referer HTTP header |
User-Agent | The contents of the User-Agent HTTP header |
Example 21-2 gives an example of a Combined Log Format entry.
Example 21-2. Combined Log Format
209.1.32.44 - - [03/Oct/1999:14:16:00 -0400] "GET / HTTP/1.0" 200 1024 "http://www.joes-
hardware.com/" "5.0: Mozilla/4.0 (compatible; MSIE 5.0; Windows 98)"
In Example 21-2, the Referer and User-Agent fields are assigned as follows:
Field | Value |
Referer | http://www.joes-hardware.com/ |
User-Agent | 5.0: Mozilla/4.0 (compatible; MSIE 5.0; Windows 98) |
The first seven fields of the example Combined Log Format entry in Example 21-2 are exactly as they would be in the Common Log Format (see the first entry in Example 21-1). The two new fields, Referer and User-Agent, are tacked onto the end of the log entry.
Netscape Extended Log Format
When Netscape entered into the commercial HTTP application space, it defined for its servers many log formats that have been adopted by other HTTP application developers. Netscape's formats derive from the NCSA Common Log Format, but they extend that format to incorporate fields relevant to HTTP applications such as proxies and web caches.
The first seven fields in the Netscape Extended Log Format are identical to those in the Common Log Format (see Table 21-1). Table 21-3 lists, in order, the new fields that the Netscape Extended Log Format introduces.
Table 21-3. Additional Netscape Extended Log Format fields | |
Field | Description |
proxy-response-code | If the transaction went through a proxy, the HTTP response code from the server to the proxy |
proxy-response-size | If the transaction went through a proxy, the Content-Length of the server's response entity sent to the proxy |
client-request-size | The Content-Length of any body or entity in the client's request to the proxy |
proxy-request-size | If the transaction went through a proxy, the Content-Length of any body or entity in the proxy's request to the server |
client-request-hdr-size | The length, in bytes, of the client's request headers |
proxy-response-hdr-size | If the transaction went through a proxy, the length, in bytes, of the proxy's response headers that were sent to the requestor |
proxy-request-hdr-size | If the transaction went through a proxy, the length, in bytes, of the proxy's request headers that were sent to the server |
server-response-hdr-size | The length, in bytes, of the server's response headers |
proxy-timestamp | If the transaction went through a proxy, the elapsed time for the request and response to travel through the proxy, in seconds |
Example 21-3 gives an example of a Netscape Extended Log Format entry.
Example 21-3. Netscape Extended Log Format
209.1.32.44 - - [03/Oct/1999:14:16:00-0400] "GET / HTTP/1.0" 200 1024 200 1024 0 0 215 260
279 254 3
In this example, the extended fields are assigned as follows:
Field | Value |
proxy-response-code | 200 |
proxy-response-size | 1024 |
client-request-size | 0 |
proxy-request-size | 0 |
client-request-hdr-size | 215 |
proxy-response-hdr-size | 260 |
proxy-request-hdr-size | 279 |
server-response-hdr-size | 254 |
proxy-timestamp | 3 |
The first seven fields of the example Netscape Extended Log Format entry in Example 21-3 mirror the entries in the Common Log Format example (see the first entry in Example 21-1).
Netscape Extended 2 Log Format
Another Netscape log format, the Netscape Extended 2 Log Format, takes the Extended Log Format and adds further information relevant to HTTP proxy and web caching applications. These extra fields help paint a better picture of the interactions between an HTTP client and an HTTP proxy application.
The Netscape Extended 2 Log Format derives from the Netscape Extended Log Format, and its initial fields are identical to those listed in Table 21-3 (it also extends the Common Log Format fields listed in Table 21-1).
Table 21-4 lists, in order, the additional fields of the Netscape Extended 2 Log Format.
Table 21-4. Additional Netscape Extended 2 Log Format fields | |
Field | Description |
route | The route that the proxy used to make the request for the client (see Table 21-5) |
client-finish-status-code | The client finish status code; specifies whether the client request to the proxy completed successfully (FIN) or was interrupted (INTR) |
proxy-finish-status-code | The proxy finish status code; specifies whether the proxy request to the server completed successfully (FIN) or was interrupted (INTR) |
cache-result-code | The cache result code; tells how the cache responded to the request |
Table 21-7 lists the Netscape cache result codes.
Example 21-4 gives an example of a Netscape Extended 2 Log Format entry.
Example 21-4. Netscape Extended 2 Log Format
209.1.32.44 - - [03/Oct/1999:14:16:00-0400] "GET / HTTP/1.0" 200 1024 200 1024 0 0 215 260
279 254 3 DIRECT FIN FIN WRITTEN
The extended fields in this example are assigned as follows:
Field | Value |
route | DIRECT |
client-finish-status-code | FIN |
proxy-finish-status-code | FIN |
cache-result-code | WRITTEN |
The first 16 fields in the Netscape Extended 2 Log Format entry in Example 21-4 mirror the entries in the Netscape Extended Log Format example (see Example 21-3).
Table 21-5 lists the valid Netscape route codes.
Table 21-5. Netscape route codes | |
Value | Description |
DIRECT | The resource was fetched directly from the server. |
PROXY(host:port) | The resource was fetched through the proxy "host." |
SOCKS(socks:port) | The resource was fetched through the SOCKS server "host." |
Table 21-6 lists the valid Netscape finish codes.
Table 21-6. Netscape finish status codes | |
Value | Description |
- | The request never even started. |
FIN | The request was completed successfully. |
INTR | The request was interrupted by the client or ended by a proxy/server. |
TIMEOUT | The request was timed out by the proxy/server. |
Table 21-7 lists the valid Netscape cache codes.
Chapter 7 discusses HTTP caching in detail.
Table 21-7. Netscape cache codes | |
Code | Description |
- | The resource was uncacheable. |
WRITTEN | The resource was written into the cache. |
REFRESHED | The resource was cached and it was refreshed. |
NO-CHECK | The cached resource was returned; no freshness check was done. |
UP-TO-DATE | The cached resource was returned; a freshness check was done. |
HOST-NOT-AVAILABLE | The cached resource was returned; no freshness check was done because the remote server was not available. |
CL-MISMATCH | The resource was not written to the cache; the write was aborted because the Content-Length did not match the resource size. |
ERROR | The resource was not written to the cache due to some error; for example, a timeout occurred or the client aborted the transaction. |
Netscape applications, like many other HTTP applications, have other log formats too, including a Flexible Log Format and a means for administrators to output custom log fields. These formats allow administrators greater control and the ability to customize their logs by choosing which parts of the HTTP transaction (headers, status, sizes, etc.) to report in their logs.
The ability for administrators to configure custom formats was added because it is difficult to predict what information administrators will be interested in getting from their logs. Many other proxies and servers also have the ability to emit custom logs.
Squid Proxy Log Format
The Squid proxy cache (http://www.squid-cache.org) is a venerable part of the Web. Its roots trace back to one of the early web proxy cache projects (ftp://ftp.cs.colorado.edu/pub/techreports/schwartz/Harvest.Conf.ps.Z). Squid is an open source project that has been extended and enhanced by the open source community over the years. Many tools have been written to help administer the Squid application, including tools to help process, audit, and mine its logs. Many subsequent proxy caches adopted the Squid format for their own logs so that they could leverage these tools.
The format of a Squid log entry is fairly simple. Its fields are summarized in Table 21-8.
Table 21-8. Squid Log Format fields | |
Field | Description |
timestamp | The timestamp when the request arrived, in seconds since January 1, 1970 GMT. |
time-elapsed | The elapsed time for request and response to travel through the proxy, in milliseconds. |
host-ip | The IP address of the client's (requestor's) host machine. |
result-code/status | The result field is a Squid-ism that tells what action the proxy took during this request ; the code field is the HTTP response code that the proxy sent to the client. |
size | The length of the proxy's response to the client, including HTTP response headers and body, in bytes. |
method | The HTTP method of the client's request. |
url | The URL in the client's request. |
rfc931-ident | The client's authenticated username. |
hierarchy/from | Like the route field in Netscape formats, the hierarchy field tells what route the proxy used to make the request for the client. The from field tells the name of the server that the proxy used to make the request. |
content-type | The Content-Type of the proxy response entity. |
Table 21-9 lists the various result codes and their meanings.
Recall from Chapter 2 that proxies often log the entire requested URL, so if a username and password component are in the URL, a proxy can inadvertently record this information.
The rfc931-ident, hierarchy/from, and content-type fields were added in Squid 1.1. Previous versions did not have these fields.
RFC 931 describes the ident lookup used in this authentication.
http://squid.nlanr.net/Doc/FAQ/FAQ-6.html#ss6.6 lists all of the valid Squid hierarchy codes.
Example 21-5 gives an example of a Squid Log Format entry.
Example 21-5. Squid Log Format
99823414 3001 209.1.32.44 TCP_MISS/200 4087 GET http://www.joes-hardware.com - DIRECT/
proxy.com text/html
The fields are assigned as follows:
Field | Value |
timestamp | 99823414 |
time-elapsed | 3001 |
host-ip | 209.1.32.44 |
action-code | TCP_MISS |
status | 200 |
size | 4087 |
method | GET |
URL | http://www.joes-hardware.com |
RFC 931 ident | - |
hierarchy | DIRECT |
from | proxy.com |
content-type | text/html |
The DIRECT Squid hierarchy value is the same as the DIRECT route value in Netscape log formats.
Table 21-9 lists the various Squid result codes.
Several of these action codes deal more with the internals of the Squid proxy cache, so not all of them are used by other proxies that implement the Squid Log Format.
Table 21-9. Squid result codes | |
Action | Description |
TCP_HIT | A valid copy of the resource was served out of the cache. |
TCP_MISS | The resource was not in the cache. |
TCP_REFRESH_HIT | The resource was in the cache but needed to be checked for freshness. The proxy revalidated the resource with the server and found that the in-cache copy was indeed still fresh. |
TCP_REF_FAIL_HIT | The resource was in the cache but needed to be checked for freshness. However, the revalidation failed (perhaps the proxy could not connect to the server), so the "stale" resource was returned. |
TCP_REFRESH_MISS | The resource was in the cache but needed to be checked for freshness. Upon checking with the server, the proxy learned that the resource in the cache was out of date and received a new version. |
TCP_CLIENT_REFRESH_MISS | The requestor sent a Pragma: no-cache or similar Cache-Control directive, so the proxy was forced to fetch the resource. |
TCP_IMS_HIT | The requestor issued a conditional request, which was validated against the cached copy of the resource. |
TCP_SWAPFAIL_MISS | The proxy thought the resource was in the cache but for some reason could not access it. |
TCP_NEGATIVE_HIT | A cached response was returned, but the response was a negatively cached response. Squid supports the notion of caching errors for resources-for example, caching a 404 Not Found response-so if multiple requests go through the proxy-cache for an invalid resource, the error is served from the proxy cache. |
TCP_MEM_HIT | A valid copy of the resource was served out of the cache, and the resource was in the proxy cache's memory (as opposed to having to access the disk to retrieve the cached resource). |
TCP_DENIED | The request for this resource was denied, probably because the requestor does not have permission to make requests for this resource. |
TCP_OFFLINE_HIT | The requested resource was retrieved from the cache during its offline mode. Resources are not validated when Squid (or another proxy using this format) is in offline mode. |
UDP_* | The UDP_* codes indicate that requests were received through the UDP interface to the proxy. HTTP normally uses the TCP transport protocol, so these requests are not using the HTTP protocol. |
UDP_HIT | A valid copy of the resource was served out of the cache. |
UDP_MISS | The resource was not in the cache. |
UDP_DENIED | The request for this resource was denied, probably because the requestor does not have permission to make requests for this resource. |
UDP_INVALID | The request that the proxy received was invalid. |
UDP_MISS_NOFETCH | Used by Squid during specific operation modes or in the cache of frequent failures. A cache miss was returned and the resource was not fetched. |
NONE | Logged sometimes with errors. |
TCP_CLIENT_REFRESH | See TCP_CLIENT_REFRESH_MISS. |
TCP_SWAPFAIL | See TCP_SWAPFAIL_MISS. |
UDP_RELOADING | See UDP_MISS_NOFETCH. |
Squid has its own protocol for making these requests: ICP. This protocol is used for cache-to-cache requests. See http://www.squid-cache.org for more information.