Abstract
In this paper, we present COFFEE, cross-layer optimization for fast and efficient executions of the Sinkhorn-Knopp (SK) algorithm on HPC systems with clusters of compute nodes by exploring some architectural features of the system. By analyzing the performance of a typical implementation of the SK algorithm on such a system, a huge performance gap is observed between the row rescaling and column rescaling of the algorithm, where the latter requires much more time than the former. We also found that the costly MPI communication of the column rescaling seriously hinders the exploitation of parallelism. By observing and leveraging unique architectural characteristics across different system optimizations, such as column rescaling redesign, data blocking, micro-kernel design, enhanced intra-node and inter-node communication in MPI, etc., COFFEE is able to explore cross-layer optimization opportunities that enable fast and efficient execution of the SK algorithm. Our experimental results show that COFFEE provides up to 7.5X with an average of 2.0X performance improvement over the typical implementation on a single node, and up to 2.9X with an average of 1.6X performance improvement over the state-of-the-art MPI Allreduce algorithms on Tianhe-1 supercomputer.
Original language | English (US) |
---|---|
Pages (from-to) | 2167-2179 |
Number of pages | 13 |
Journal | IEEE Transactions on Parallel and Distributed Systems |
Volume | 34 |
Issue number | 7 |
DOIs | |
State | Published - Jul 1 2023 |
Externally published | Yes |
All Science Journal Classification (ASJC) codes
- Signal Processing
- Hardware and Architecture
- Computational Theory and Mathematics
Keywords
- Data blocking
- HPC system
- MPI allreduce
- micro-kernel design
- sinkhorn-knopp algorithm