Improving the Network Interfaces for Gigabit Ethernet in Clusters of PCs by Protocol Speculation Christian Kurmann, Michel Muller, Felix Rauch and Thomas M. Stricker Laboratory for Computer Systems Swiss Federal Institute of Technology (ETH) CH-8092 Zurich, Switzerland {kurmann,rauch,tomstr}@inf.ethz.ch http://www.inf.ethz.ch/ CS Technical Report #339 Modern massively parallel computers are built from commodity processors and memories that are used in workstations and PCs. A similar trend towards commodity components is visible for the interconnects that connect multiple nodes in clusters of PCs. Only a few years ago the market was dominated by highly specialized supercomputer interconnects (e.g. in a Cray T3D). Todays networking solutions are still proprietary but they do connect to standard buses (e.g. as PCI card). In the future the networking solutions of the Internet (e.g. Gigabit Ethernet) will offer Gigabit speeds at lower costs. Commodity platforms offer good compute performance, but they can not yet fully utilize the potential of Gigabit/s communication technology, at least as long as commodity networks like Ethernet with standard protocols like TCP/IP are used. While the speed of Ethernet has grown from 10 to 1000 Mbit/s the functionality and the architectural support in the network interfaces has not kept up and the driver software becomes a limiting factor. Network speeds are catching up rapidly to the streaming speed of main memory. Therefore a true zero-copy network interface architecture is required to sustain the raw network speed in applications. Many common Giga-bit Ethernet network protocol stacks are called zero-copy at the OS level, but upon a closer look they are not really zero-copy down to the hardware level, since there remains a last copy in the driver for the fragmenta-tion/ defragmentation of the transfered network packets that are smaller than a page size. Defragmenting all the packets of various communication protocols correctly in hardware remains an extremely complex task, resulting in a large amount of additional circuitry to be incorporated into to existing commodity hardware. Therefore we consider the different route of studying and implementing a speculative defragmentation technique, that can eliminate the last defragmenting copy operation from zero-copy TCP/IP stacks on existing hardware. The speculative technique shows even greater potential for improved efficiency once the present network interfaces are enhanced by a few protocol matching registers with a simple control path to the DMA engines. The payload of fragmented packets is separated from the headers and stored into a memory page that can be mapped directly to its final destination in user memory. The checks for correctness and compliance with the protocol are deferred until, after several packets, an interrupt for protocol processing is taken. The success of a speculative approach suggests that a modest hardware addition to a current Gigabit Ethernet adapter design (e.g. the Hamachi chip) is sufficient to provide a high speed data path for zero-copy bulk transfers. For an evaluation of our ideas we integrated a network interface driver with speculative defragmenting into existing zero-copy protocol stacks with page remapping, fbufs or user/kernel shared memory. Performance mea-surements indicate that we can improve performance over the standard Linux 2.2 TCP/IP by a factor of 1.5 2 for uninterrupted burst transfers. Based on those implementations we can present real measurement data on how a simple protocol matching hardware could improve the performance of bulk transfers with a commodity Gigabit Ethernet interface. As with any hardware solution using speculative techniques, a fairly accurate prediction of the good case (i.e. that a sequence of incoming packets are consecutive) is required. We show success rates of uninterrupted bulk transfers for a database and a scientific computation code on a cluster of PCs. The hit rate can be greatly improved with a simple matching mechanism in the network interface that allows to separate packets suitable to zero-copy processing from other packets to be handled with a regular protocol stack.