Optimizing the distribution of large data sets in theory and practice Felix Rauch, Christian Kurmann and Thomas M. Stricker Laboratory for Computer Systems, ETH - Swiss Institute of Technology, CH-8092 Zurich, Switzerland http://www.cs.inf.ethz.ch/ SUMMARY Multicasting large amounts of data efficiently to all nodes of a PC cluster is an important operation. In the form of a partition cast it can be used to replicate entire software installations by cloning. Optimizing a partition cast for a given cluster of PCs reveals some interesting architectural tradeoffs, since the fastest solution does not only depend on the network speed and topology, but remains highly sensitive to other resources like the disk speed, the memory systemperformance and the processing power in the participating nodes. We present an analytical model that guides an implementation towards an optimal configuration for any given PC cluster. The model is validated by measurements on our cluster using Gigabit- and Fast-Ethernet links. The resulting simple software tool, Dolly, can replicate an entire 2 GB Windows NT image onto 24 machines in less than 5 min. Copyright (c) 2002 John Wiley & Sons, Ltd. KEY WORDS: software installation and maintenance; data streaming; partition management; communication modelling; multicast; input output systems Published in: CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2002; 14:165-181 (DOI: 10.1002/cpe.603)