Characterizing memory system performance for local and remote accesses in high end SMPs, low end SMPs and clusters of SMPs Ch. Kurmann and T. Stricker Laboratory for Computer Systems ETHZ - Swiss Institute of Technology CH-8092 Zuerich, SWITZERLAND In todays market place there are low end SMPs (multiprocessor PCs), high end SMPs (big workstations and servers) and specialized MPP nodes, all based on microprocessors with full or partial hardware support for coherent shared memory. High end designs include a carefully engineered memory system with support for any additional data streams caused by data intensive computations and by the inter-node communication. Low end designs feature extremely low cost, but much weaker memory systems. Characterizing SMP Memory Systems Performance To find out more about memory systems, we refined an existing memory system microbenchmark that is independent from naming and coherence issues and is therefore capable to characterize the memory system performance for both local and remote memory, regardless of the underlying architecture [ISCA95-StrGr]. Unlike the somewhat simplistic McCalpin loops our test captures many more aspects of the memory hierarchy (behavior under spacial and temporal locality) [HPCA97-StrGr]. In our talk we can give a detailed picture of the memory systems in low end SMPs, like e.g. a Dual PentiumPro Class PC, and compare those to high end systems, like e.g. a DEC 8400, an SGI Onyx 10k, a Sun Enterprise Server. The important issue is whether the memory system of low cost (internal bus based) SMPs can fully sustain multiprocessing and communication in clusters of SMPs. While high end systems can afford memory systems with special hooks for inter-node communication at Gigabit/s speeds, low end systems must rely entirely on standard I/O interfaces (a PCI bus) for economic reasons. For those remote memory accesses we obtained measurements from a small testbed of a few PCs connected by several Gigabit networking technologies and compare to data of an SGI Cray T3E and an SGI Origin/Infinite Reality. Conclusions From our measurements we can conclude that only the memory systems of high end SMPs scale with the number of processors. In low end SMPs, the benefits of symmetric multiprocessing ends abruptly as the working set exceeds the L2 or L3 caches that are located close to the microprocessors. For communication operations we notice excellent memory system performance on the high end MPP nodes. In the low end systems interconnected by Gigabit networks the performance can peak near the bandwidth limits of the PCI-Bus, but only for the simple transfer modes of contiguous blocks of data. For strided data or for remote loads of single words the performance in low end SMPs collapses. ETH Project CoPs - Building a Cluster of PCs based on multiprocessor Intel Pentium Pro nodes. Our computer architecture group is currently engaged in designing and building a cluster of PCs from off-the-shelf SMPs linked by a commercial Gigabit interconnect (Dolphin/Scali SCI, Myricom Myrinet or Gigabit Ethernet). The studies on memory systems performance for shared memory is part of that building effort. A fully typeset two page summary including several graphs with performance figures can be viewed under http://www.cs.inf.ethz.ch/CoPs/isca98ws.pdf.