Cluster 2005 START ConferenceManager    

Job-Site Level Fault Tolerance for Cluster and Grid environments

Kshitij Limaye, Box Leangsuksun, Zeno Greenwood, Stephen L. Scott, Christian Engelmann, Richard Libby and Kasidit Chanchio

IEEE International Conference on Cluster Computing (Cluster 2005)
Boston, Massachusetts, USA, September 27 - 30, 2005


Abstract

In order to adopt high performance clusters and grid computing for mission critical applications, fault tolerance is a necessity. Common fault tolerance techniques in distributed systems are normally achieved with checkpoint-recovery and job replication on alternative resources, in cases of a system outage. The latter approach depends on the availability of alternative sites to run replicas and also leaves the crashed site for recovery rendering reduction in total available resources. There is a need for complementing these approaches by proactively handling failures at a job-site level, ensuring the system high availability with no loss of user submitted jobs. This paper discusses a novel fault tolerance technique that enables the job-site recovery in cluster-based grid environments, where as existing techniques give up a failed system by seeking alternative resources. Our results suggest sizable aggregate performance improvement during an implementation of our method in Globus-enabled HA-OSCAR. The technique called “Smart Failover” provides a transparent and graceful recovery mechanism that saves job states in a local job-manager queue and transfers those states to the backup server periodically, and in critical system events. Thus whenever a failover occurs, the backup server is able to restart the jobs from their last saved state. We also present a variant of the smart failover feature using a modified Globus job-submission client to accomplish the resilient mechanism in case of remote site head node failure.


  
START Conference Manager (V2.49.7)
Maintainer: rrgerber@softconf.com