In order to achieve good scalability, most large-scale parallel machines are based on physically distributed-memory architectures. While such architectures may provide easy machine scalability, performance scallability is another matter. Because memory is distributed across the network, a remote memory access incurs large latency that negatively affects performance. In order to improve performance, a standard approach is to tailor the data and workload distribution of a problem to fit specific architectures. We believe such an approach is not desirable because it goes against the idea of general-purpose parallel computing. This report presents our experience in multithreaded execution model. The objective of this execution model is latency tolerance which aims to make performance less sensitive to data partitioning strategies. Specifically, we investigate (1) the effects of threading on workload distribution, (2) the effects of threading on data distribution, and (3) thread granularity on performance. We use the 80-processor EM-4 multithreaded multiprocessor as the target machine. Matrix multiplication is selected as the model problem with three different implementations: element-wise method, row-column-major blocked method, and m × m submatrix blocked method. Execution results demonstrate that multithreading with the simple-minded data distribution and workload distribution strategies can yield at least 60% of the performance of the highly tailored best performing implementation. This is a promising result in view of the type of problem selected for comparison and the number of improvements that can be implemented into the EM-4 machine.
|Original language||English (US)|
|Number of pages||10|
|State||Published - Jan 1 1995|
All Science Journal Classification (ASJC) codes