COFFEE: Cross-Layer Optimization for Fast and Efficient Executions of Sinkhorn-Knopp Algorithm on HPC Systems

Chengyu Sun, Huizhang Luo, Hong Jiang, Jeff Zhang, Kenli Li

Research output: Contribution to journalArticlepeer-review


In this paper, we present COFFEE, cross-layer optimization for fast and efficient executions of the Sinkhorn-Knopp (SK) algorithm on HPC systems with clusters of compute nodes by exploring some architectural features of the system. By analyzing the performance of a typical implementation of the SK algorithm on such a system, a huge performance gap is observed between the row rescaling and column rescaling of the algorithm, where the latter requires much more time than the former. We also found that the costly MPI communication of the column rescaling seriously hinders the exploitation of parallelism. By observing and leveraging unique architectural characteristics across different system optimizations, such as column rescaling redesign, data blocking, micro-kernel design, enhanced intra-node and inter-node communication in MPI, etc., COFFEE is able to explore cross-layer optimization opportunities that enable fast and efficient execution of the SK algorithm. Our experimental results show that COFFEE provides up to 7.5X with an average of 2.0X performance improvement over the typical implementation on a single node, and up to 2.9X with an average of 1.6X performance improvement over the state-of-the-art MPI Allreduce algorithms on Tianhe-1 supercomputer.

Original languageEnglish (US)
Pages (from-to)2167-2179
Number of pages13
JournalIEEE Transactions on Parallel and Distributed Systems
Issue number7
StatePublished - Jul 1 2023
Externally publishedYes

All Science Journal Classification (ASJC) codes

  • Signal Processing
  • Hardware and Architecture
  • Computational Theory and Mathematics


  • Data blocking
  • HPC system
  • MPI allreduce
  • micro-kernel design
  • sinkhorn-knopp algorithm


Dive into the research topics of 'COFFEE: Cross-Layer Optimization for Fast and Efficient Executions of Sinkhorn-Knopp Algorithm on HPC Systems'. Together they form a unique fingerprint.

Cite this