Cluster 2005 START ConferenceManager    

Reliability-aware Checkpoint /Restart Scheme: A Performability Trade-off

Yudan Liu, Chokchai "Box" Leangsuksun

IEEE International Conference on Cluster Computing (Cluster 2005)
Boston, Massachusetts, USA, September 27 - 30, 2005


Abstract

In recent years, large scale clusters have been commonly deployed to solve important grand-challenge scientific problems. In order to reduce computational time, cluster sizes were increased. Unfortunately, the reliability of such cluster systems goes in the opposite direction, as the extension of a system scale. Since failures of a single node could result in a total service or application outage, it is essential to effectively deal with faulty situations in the grand challenge problem-solving environment. Checkpointing is one of common fault tolerance techniques. However, there are many challenges in checkpointing such as overhead, latency and consistency, as well as recovery. In this paper, a reliability-aware optimal checkpoint/restart method was introduced. It is a novel technique to consider checkpointing placement based on system reliability. We constructed a cost model and derived an optimal checkpoint interval based on a function of failure rates: A trade-off between performance and reliability (i.e. performability). We implemented a proof-of-concept and demonstrated improvements resulting from our techniques for fault-tolerant MPI applications on an HA-OSCAR cluster.


  
START Conference Manager (V2.49.7)
Maintainer: rrgerber@softconf.com