Local Load Access: SGI Cray T3E
Access pattern (stride between 64bit words)
This picture depicts the memory bandwidth for the T3E. The simple node architecture is reflected in the graph: there are three performance levels, exposing a sharp drop from cache access to DRAM memory access when we exceed the working set of the L1/L2 caches. Not surprisingly, the local memory access performance of the T3E resembles the picture of the DEC 8400 in the performance of its L1 and L2 caches. Any differences for small working sets are due to idiosynchrocies in the measurement environment, i.e. the lack of inlined timer functions, and the (still) larger loop overhead inflicted on programs by the Cray compiler.
The important differences to the DEC 8400 are (i) the lack of L3 caches (ii) and the better support for contiguous access streams from main memory. While the DEC 8400 achieves just 150 MByte/s for contiguous loads out of DRAM main memory, the T3E node is capable of load transfers of up to 430 MByte/s. In other words, the memory system bandwidth of DRAM on the T3E compares well with the memory system performance out of the local L3 cache on the DEC 8400 (which is about 600 MByte/s). Again this performance is due to the support hardware for streaming on the T3E.