Cluster 2005 START ConferenceManager    

Transparent Networked Checkpoint-Restart for Commodity Clusters

Oren Laadan, Dan Phung, Jason Nieh

IEEE International Conference on Cluster Computing (Cluster 2005)
Boston, Massachusetts, USA, September 27 - 30, 2005


Abstract

We have created a novel system for transparent coordinated checkpoint-restart of distributed network applications on commodity clusters. Our system provides a thin virtualization layer on top of the operating system that decouples a distributed application from dependencies on the cluster nodes on which it is executing. Unlike other approaches, our virtualization enables parallel applications to efficiently utilize multiprocessor systems. It also enables us to checkpoint the entire distributed application across all nodes in a coordinated manner so that it can be restarted at the checkpoint on a different set of cluster nodes at a later time. The checkpoint-restart operations execute in parallel across different cluster nodes, providing faster checkpoint-restart performance. We uniquely support network state in a transport protocol independent manner, including correctly saving and restoring socket and protocol state for both TCP and UDP connections. We have implemented a Linux prototype of our system and demonstrate that it provides low virtualization overhead and fast checkpoint-restart times for distributed network applications without any application, library, kernel, or network protocol modifications.


  
START Conference Manager (V2.49.7)
Maintainer: rrgerber@softconf.com