From AAPC Algorithms to High Performance Permutation Routing and Sorting Thomas M. Stricker and Jonathan C. Hardwick School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Abstract Several recent papers have proposed or analyzed optimal algorithms to route all-to-all personalized communication (AAPC) over communication networks such as meshes, hypercubes and omegaswitches. However, the constant factors of these algorithms are often an obscure function of system parameters such as link speed,processor clock rate, and memory access time. In this paper we investigate these architectural factors, showing the impact of the communication style, the network routing table, and most importantly, the local memory system, on AAPC performance and permutationrouting on the Cray T3D. The fast hardware barriers on the T3D permit a straightforward AAPC implementation using routing phases separated by barriers, which improve performance by controlling congestion. However, we found that a practical implementation was difficult, and the resulting AAPC performance was less than expected. After detailedanalysis, several corrections were made to the AAPC algorithm and to the machine's routing table, raising the performance from 41%to 74% of the nominal bisection bandwidth of the network. Most AAPC performance measurements are for permuting large, contiguous blocks of data (i.e., every processor has an array of Pcontiguous elements to be sent to every other processor). In practice, sorting and true h-h permutation routing require data elements to be gathered from their source location into a buffer, transferred over the network, and scattered into their final location in a destinationarray. We obtain an optimal T3D implementation by chaining local and remote memory operations together. We quantify the imple-mentation's efficiency both experimentally and theoretically, using the recently-introduced copy transfer model, and present results fora counting sort based on this AAPC implementation.