A comparison of Message-Passing and Distributed Shared Memory under Windows NT

Roman Roth

Diploma Thesis Winter 1998/99
Supervisors: Christian Kurmann, Prof. T. Stricker
Institute for Computer Systems, ETH Zürich

Objectives and results

Increasingly efficient PCs and workstations and networks faster than ever expand the fields of application for PC clusters. But often the software overhead limits the performance of the communication hardware. As various studies have shown, copying memory inside conventional communication models is responsible for this unsatisfactory performance. So zero-copy communication is required: Sending data from the memory of the sender directly into the memory of the recipient.

Zero-copy however is only possible if hardware and operating system support this actively. The hardware must provide techniques such as Direct Memory Access (DMA) or Device Memory. And the operating system must be able to use these techniques efficiently. An investigation of the kernel structures of Windows NT 4.0 showed that this operating system offers all the functionalities needed for an efficient implementation of zero-copy communication.

This thesis should be based on a Myrinet network. For the use of Myrinet under Windows NT 4.0, Myricom developed the GM-API. In order to go around the complex writing of kernel mode drivers, this API should be used for communication over Myrinet. Unfortunately no executable version of GM was available during the time of work. Therefore zero-copy communication was analyzed first in theory. Sending and receiving data is treated by the GM-API in zero-copy manner. But transmission is not really the problem. More interesting is the question, where to receive the transmitted data. Basically there are two different types of communication:

- The recipient is responsible to provide sufficient memory for receiving, so the sender can transmit at any time. Because memory is a limited resource, this type of communication is suitable only for smaller packets.

- The recipient allocates memory and indicates its ready-to-receive state to the sender for each packet. The sender is only allowed to send after the receipt of this indication. Because this type of communication is synchronized, it is worthwhile only for larger packets.

Both of these two types of communication are implemented in a so-called Zero-Copy Layer. This layer should be independent of the network technology it is based on. Therefore a Network Layer is placed between the Zero-Copy Layer and the network driver or API. It contains all the network dependent functionalities of zero-copy communication. The Zero-Copy Layer itself handles all memory and synchronization issues. For lack of an executable GM release two Network Layers were implemented: Named Pipes and Windows Sockets were used for communication. Both of them are not really zero-copy techniques, but they are stable and fast (especially Named Pipes) and therefore usable for local communication tests.

Furthermore this work showed that different communication protocols such as Message-Passing or Distributed Shared Memory easily can be built on top of such a Zero-Copy Layer. The interfaces between these two protocols and the Zero-Copy Layer are nearly identical, so that both protocols can be based on the same layer.

Two prototypes are realized to prove this: The Message-Passing prototype offers parts of MPI: Blocking and non-blocking sending and receiving are implemented as well as a barrier. Studies showed that all the collective communication functions of MPI could be based on these point-to-point forms efficiently.

The Distributed Shared Memory prototype implements sections of OpenMP. The memory manager of this prototype uses the principles of Lazy Release Consistency according to TreadMarks. Multiple-Writer functionalities are not realized because of contradictions to the zero-copy communication.

Measurements of bandwidth and latency over Named Pipes showed that the Zero-Copy Layer (based on GM) could be a powerful implementation of zero-copy communication.

Two concrete applications (Quick-Sort and Gauss elimination) for the two prototypes prove that the MPI prototype is really applicable. The OpenMP prototype however is not powerful enough for certain problems. But there are several idees how to increase its performance.

[ CS-Department | Up ]

ETH Zürich: Department of Computer Science

Comments to Christian Kurmann <kurmann@inf.ethz.ch>
Date 03-10-1999