Performance in an SMP setting
Copy bandwidth decreases for simultaneous access with 1, 2, 4 and 8 processors
Topics of interest:
- small working sets in caches: performance remains same
- large working sets in memory: interesting differences
- behavior for even/uneven strides
“Gather copy stream” (strided load / contiguous store)
To really reveal the differences of High and Low End systems we separately compare the copy bandwidth in a multiprocessing scenario where either 1, 2, 4 and 8 processors copy data in the memory.
For small working sets in the caches, the performance remains the same, as measurements prove.
More interesting are the difference for large working sets in main memory.
We not only use a simple copy as e.g. MCCalpins ‘Stream Benchmark’ but measure a gather copy stream where the processors read strided data and store it contiguously.