Slide 21 of 24
The T3E represents a further step from a distributed memory machine to a coherent shared memory multi-processor. Its global address space can map the entire memory in the machine, but a compiler is still forced to copy the data from remote memory to local memory or at least the 512 E-registers before computing on it. The characterizations reported in this section are based on the shmem iget and shmem iput routines provided by Cray which we treated as black box building blocks.
The performance on the T3E is a class on its own. It can transfer 350 MByte/s for contiguous blocks and falls down to 140 MByte/s or 70 MByte/s for strided accesses (depending on how the transfer is programmed).