Message Passing for Gigabit/s Networks with Zero-Copy under Linux

Irina Chihaia (Technical University from Cluj-Napoca)

Diploma Thesis Sommer 1999
Supervisors: Christian Kurmann, Prof. T. Stricker
Institute for Computer Systems, ETH Zürich

Objectives and results


    The MPICH implementation of the MPI standard is build on a lower level communication layer called the abstract device interface, ADI. The purpose of this interface is to make it easy to port the MPICH implementation by separating it into an independent module that handles just the communication between two processes.
    A message-passing ADI must provide four sets of functions: specifying a message to be sent or received, moving data between the API and the message-passing hardvare, managing queues (of pending receives and unexpected messages), and basic information about the execution environment.

    The most important problem is that data is copied in the portable messaging libraries or in the standard communication protocol stacks. One solution is the implementation of "zero-copy" in communication system software.

    The implementation of a communication system with a zero-copy layer relies on some low level drivers for access to the hardware and some higher level software services (e.g. collective communication).
    Traditional UNIX I/O interfaces are based on copy semantics, where read and write calls transfer data between kernel and user-defined buffers.
    Data touching overheads include those operations that require processig of the data within a given buffer such as checksumming or copying from one buffer to another.There are some efforts for reducing data-touching overheads and they include protocol integrated layer processing, high-performance network adapters (by careful design) to eliminate copying between devices and the OS kernel and restructuring OS software to minimize data movement.

     The main purpose is to improve MPICH in order to achieve real Gigabit/s speeds on 1000BaseSX Ethernet communications.


  1. Establish the performance of the MPICH implementation.
  2. Integrate and enhance the existing zero-copy layer in Linux OS. Establish the performance of the new MPICH implementation.
  3. Implement in Linux OS an efficient zero-copy layer based on fast buffer concept. Establish the performance of the resulted MPICH implementation.

    Problems to be investigated

1.Performance measurements of the MPICH implementation

    The MPICH implementation includes  two MPI programs, mpptest and goptest that provide reliable test of the performance of an MPI implementation. The program mpptest provides testing of both point-to-point and collective operations on a specified number of processors, the program goptest can be used to study the scalability of collective routines as a function of number of processors.
    There also exists a script basetest, provided with the MPICH implementation, that can be used to get a more complete picture of the behaviour of a particular system.
    The basic data are short- and long-message performance.
    By using these programs we can get a picture of the best achievable bandwidth performance.

2.Identify and solve all problems regarding integration of the zero-copy layer in Linux OS

    The second stage of this diplomawork consists of integration of the existing zero-copy layer in Linux operating system.

3.Improve the current MPICH version

    Another step is to investigate and identify the weaknesses of the resulting MPICH implementation. Based on these, it follows the improvement of the current MPICH product. Establish the current performance.
4.Implement an efficient zero-copy layer based on fast buffer concept

    There exists a very interesting proposal for An Efficient Zero-Copy I/O Framework for UNIX made by Sun Microsystems Laboratories, Inc. regarding buffer management and exchange between application programs and the UNIX kernel.

    My proposal is to make use of their solution, adapt it in Linux operating system. Even if both, Solaris and Linux, are implementations of the UNIX OS, there are some differences between them.

    A high-bandwidth cross-domain transfer facility, called fast buffers (fbufs), combines virtual page remapping with shared virtual memory, and exploits locality in I/O traffic to achieve high throughput without compromising protection and security.
    The UNIX interface has copy semantics, and it allows the application to specify an unaligned buffer address anywhere in its address space. Therefore, it is necessary the addition of an interface, based on explicit buffer exchange, for high-bandwidth I/O.
    This solution has the following main elements:

    To utilize the full potential of the fbufs scheme, system support for two features is needed:     An application will be able to use the new API even if the system does not support the above features. In this case the performance will be improved only for write and not for read operation.
    We will make use of G-NIC[tm] II Network Interface Card, which include a set of important characteristics such as advanced dynamic interrupt coalescing, TCP/IP checksum offload capabilities and large external buffer memories. Therefore, G-NIC[tm] is able to satisfy the required system support featureas.

    The extensions to the API provides for the explicit exchange of buffers (containing data) between application and OS which eliminates copying (fig.1.).
      There are some components to be implemented.
    First, a library is required in order to provide the fbufs interface to Linux application.
    Then it will be implemented a buffer pool manager, responsible for allocation of memory, tracking the allocation/deallocation of individual fbufs, managing mappings between user and kernel addresses and conversion between fbufs and Stream mblks.
    A new system call implementation will provide the functionality of new read, write, get interfaces.
    The library invokes the new system call component, via a trap, to transfer fbufs between kernel and application.
    Device driver interface extensions allow the I/O subsystem to allocate fbufs in the kernel.

    The device driver is changed regarding allocation and management of fbufs. It implements a small amount of housekeeping in the device driver.
    The PC system has separate address spaces for the OS and for I/O. Device support routines are provided by the OS for device drivers to translate between the two domains. In traditional UNIX system, the addresses of buffers used in I/O operation are fairly arbitrary, but in our implementation the same buffers are frequently reused for I/O to the same device. The device driver had to be optimized to take advantage of this referential locality by cacheing translations between kernel and I/O addresses of fbufs and thus avoiding the expensive translation routines.

    Establish the current performance.  The fbufs cost will be compared with the cost of the corresponding operation using the standard Streams or sockets.

 5. Conclusions

    Based on the performance evaluation performed for each product obtained during these three stages, there will be some conclusions regarding the improvements in the performance on 1000BaseSX Ethernet communication.


  1. C.Kurmann, R.Roth, T.Stricker, "Investigating operating System Support for Gigabit/s Networks with "Zero-Copy" in Windows NT and in LINUX"
  2. M.Thadani, Y.Khalidi, "An Efficient Zero-Copy I/O Framework for UNIX"
  3. P.Druschel and L.Peterson, "Fbufs: A High-Bandwidth Cross-Domain Transfer Facility."

  4. [ CS-Department | Up ]

    ETH Zürich: Department of Computer Science

    Comments to Christian Kurmann <>
    Date 27-9-1999