TY - JOUR
T1 - Remote memory access
T2 - A case for portable, efficient and library independent parallel programming
AU - Gerbessiotis, Alexandros V.
AU - Lee, Seung Yeop
PY - 2004
Y1 - 2004
N2 - In this work we make a strong case for remote memory access (RMA) as the effective way to program a parallel computer by proposing a framework that supports RMA in a library independent, simple and intuitive way. If one uses our approach the parallel code one writes will run transparently under MPI-2 enabled libraries but also bulk-synchronous parallel libraries. The advantage of using RMA is code simplicity, reduced programming complexity, and increased efficiency. We support the latter claims by implementing under this framework a collection of benchmark programs consisting of a communication and synchronization performance assessment program, a dense matrix multiplication algorithm, and two variants of a parallel radix-sort algorithm and examine their performance on a LINUX-based PC cluster under three different RMA enabled libraries: LAM MPI, BSPlib, and PUB. We conclude that implementations of such parallel algorithms using RMA communication primitives lead to code that is as efficient as the message-passing equivalent code and in the case of radix-sort substantially more efficient. In addition our work can be used as a comparative study of the relevant capabilities of the three libraries.
AB - In this work we make a strong case for remote memory access (RMA) as the effective way to program a parallel computer by proposing a framework that supports RMA in a library independent, simple and intuitive way. If one uses our approach the parallel code one writes will run transparently under MPI-2 enabled libraries but also bulk-synchronous parallel libraries. The advantage of using RMA is code simplicity, reduced programming complexity, and increased efficiency. We support the latter claims by implementing under this framework a collection of benchmark programs consisting of a communication and synchronization performance assessment program, a dense matrix multiplication algorithm, and two variants of a parallel radix-sort algorithm and examine their performance on a LINUX-based PC cluster under three different RMA enabled libraries: LAM MPI, BSPlib, and PUB. We conclude that implementations of such parallel algorithms using RMA communication primitives lead to code that is as efficient as the message-passing equivalent code and in the case of radix-sort substantially more efficient. In addition our work can be used as a comparative study of the relevant capabilities of the three libraries.
UR - http://www.scopus.com/inward/record.url?scp=22044452114&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=22044452114&partnerID=8YFLogxK
U2 - 10.1155/2004/934718
DO - 10.1155/2004/934718
M3 - Article
AN - SCOPUS:22044452114
SN - 1058-9244
VL - 12
SP - 169
EP - 183
JO - Scientific Programming
JF - Scientific Programming
IS - 3
ER -