COFFEE: Cross-Layer Optimization for Fast and Efficient Executions of Sinkhorn-Knopp Algorithm on HPC Systems

Chengyu Sun, Huizhang Luo, Hong Jiang, Jeff Zhang, Kenli Li

Research output: Contribution to journalArticlepeer-review


In this paper, we present COFFEE, cross-layer optimization for fast and efficient executions of the Sinkhorn-Knopp (SK) algorithm on HPC systems with clusters of compute nodes by exploring some architectural features of the system. By analyzing the performance of a typical implementation of the SK algorithm on such a system, a huge performance gap is observed between the row rescaling and column rescaling of the algorithm, where the latter requires much more time than the former. We also found that the costly MPI communication of the column rescaling seriously hinders the exploitation of parallelism. By observing and leveraging unique architectural characteristics across different system optimizations, such as column rescaling redesign, data blocking, micro-kernel design, enhanced intra-node and inter-node communication in MPI, etc., COFFEE is able to explore cross-layer optimization opportunities that enable fast and efficient execution of the SK algorithm. Our experimental results show that COFFEE provides up to 7.5X with an average of 2.0X performance improvement over the typical implementation on a single node, and up to 2.9X with an average of 1.6X performance improvement over the state-of-the-art MPI Allreduce algorithms on Tianhe-1 supercomputer.

Original languageEnglish (US)
Pages (from-to)1-13
Number of pages13
JournalIEEE Transactions on Parallel and Distributed Systems
StateAccepted/In press - 2023
Externally publishedYes

All Science Journal Classification (ASJC) codes

  • Signal Processing
  • Hardware and Architecture
  • Computational Theory and Mathematics


  • Clustering algorithms
  • Cross layer design
  • Data blocking
  • HPC system
  • Machine learning algorithms
  • micro-kernel design
  • MPI allreduce
  • Optimization
  • Program processors
  • sinkhorn-knopp algorithm
  • Standards
  • Supercomputers


Dive into the research topics of 'COFFEE: Cross-Layer Optimization for Fast and Efficient Executions of Sinkhorn-Knopp Algorithm on HPC Systems'. Together they form a unique fingerprint.

Cite this