Fault Tolerance

The business community nowadays expects that a Web service to be not only available and scalable but highly fault tolerant. As Web services evolve, those built using fault tolerant technology will need to support highly complex services. Many services will be expected to handle thousands of clients or more concurrently. Reliability and performance become crucial to the success and brand of the deployed Web services.

Many app server vendors propose to handle this problem by clustering. Each vendor has its own scheme, including DNS aliasing, TCP connection routing, HTTP redirection, and other approaches. While this increases scalability and performance, it does not address fault tolerance for services. Clusters can detect when one server fails and transparently replace the affected server with a redundant one. This increases availability, but any ongoing requests of the failed server will be lost. A fault tolerant server migrates requests on the failed server to a redundant server and starts recovery. The clustering capabilities of many app servers also do not appropriately handle unpredictable large bursts of requests. Even more important, in high-traffic Web services, requests for low-profit services can overwhelm the request for high-profit services or services for VIP customers. Requests for services, especially highly profitable ones, should remain available, even when the servers are heavily loaded. There is no standard way of providing higher priority for these services over low-profit requests. To affect these two critical issues, a strategy must be developed that supports request determination and request migration.

Request Determination

A fault tolerant Web service must support two capabilities: checkpointing and recovery. Checkpointing snapshots client requests in an intermediate state and provides for logging. When a Web service fails, the recovery mechanism can read the log of outstanding requests on the failed service and continue processing on an active service.

Capturing the state of all requests could become expensive in terms of performance and therefore should be reserved for mission-critical requests. At minimum, transactional requests should have preference for logging over inquiry requests. Logging and recovery should be transparent to the requestor. The best way to accomplish this is by creating a dispatcher Web service, which takes requests on behalf of other services and dispatches them to the service best suited to handle them. Before sending the request, the dispatcher stores related information about the request and the selected target service in an internal map. The dispatcher also makes routing decisions and could either statically store all service endpoints or maintain its own routing table of services discovered via UDDI.

Request Migration

A Web service requires a mechanism that identifies service overload and/or failure. If the target Web service is inquiry, the dispatcher can send the request to another service for fulfillment. If the target Web service is transactional, the dispatcher may need to participate in a distributed transaction, store the resulting state, and return the results to the client. The client is typically not informed whether a request has committed successfully or of the reason for a failure. Sometimes clients may have to wait for a timeout to occur. Clients that implement retry logic may be charged for submitting the request multiple times, which should be avoided. Migrating the request to another service requires intimate knowledge of service-specific details, such as the internal state, intermediate parameters, and so on. The intermediate processing state must be replicated to ensure fault tolerance of the service itself.

Fault tolerance assures that no request will be lost because of either server or service failure or overload, but it has a monetary cost and a potential performance cost as well. This approach requires care with implementation details.