TY - GEN
T1 - Stability-preserving Lossy Compression for Large-scale Partial Differential Equations
AU - Gong, Qian
AU - Ainsworth, Mark
AU - Chen, Jieyang
AU - Liang, Xin
AU - Zhu, Liangji
AU - Klasky, Ethan
AU - Athawale, Tushar
AU - Liu, Qing
AU - Rangarajan, Anand
AU - Ranka, Sanjay
AU - Klasky, Scott
N1 - Publisher Copyright:
© 2025 Copyright held by the owner/author(s).
PY - 2025/11/15
Y1 - 2025/11/15
N2 - Checkpoint/Restart (C/R) strategies are vital for fault tolerance in PDE-based scientific simulations, yet traditional checkpointing incurs significant I/O overhead. Lossy compression offers a scalable solution by reducing checkpoint data size, but conventional methods often lack control over physical invariants (e.g., energy), leading to instability such as oscillations or divergence in Partial Differential Equations (PDE) systems. This paper introduces a stability-preserving compression approach tailored for PDE simulations by explicitly controlling kinetic and potential energy perturbations to ensure stable restarts. Extensive experiments conducted across diverse PDE configurations demonstrate that our method maintains numerical stability with minimal error magnification-even across multiple checkpoint-restart cycles-outperforming state-of-the-art lossy compressors. Parallel evaluations on the Frontier supercomputer show up to 8.4× improvement in checkpoint write performance and 6.3× in read performance, while maintaining relative L2 errors ∼2e-6 throughout continued simulation. These results provide practical guidance for balancing compression accuracy, stability, and computational efficiency in large-scale PDE applications.
AB - Checkpoint/Restart (C/R) strategies are vital for fault tolerance in PDE-based scientific simulations, yet traditional checkpointing incurs significant I/O overhead. Lossy compression offers a scalable solution by reducing checkpoint data size, but conventional methods often lack control over physical invariants (e.g., energy), leading to instability such as oscillations or divergence in Partial Differential Equations (PDE) systems. This paper introduces a stability-preserving compression approach tailored for PDE simulations by explicitly controlling kinetic and potential energy perturbations to ensure stable restarts. Extensive experiments conducted across diverse PDE configurations demonstrate that our method maintains numerical stability with minimal error magnification-even across multiple checkpoint-restart cycles-outperforming state-of-the-art lossy compressors. Parallel evaluations on the Frontier supercomputer show up to 8.4× improvement in checkpoint write performance and 6.3× in read performance, while maintaining relative L2 errors ∼2e-6 throughout continued simulation. These results provide practical guidance for balancing compression accuracy, stability, and computational efficiency in large-scale PDE applications.
KW - Checkpoint-restart
KW - large-scale PDEs
KW - lossy compression
KW - stability preservation
UR - https://www.scopus.com/pages/publications/105023974198
UR - https://www.scopus.com/pages/publications/105023974198#tab=citedBy
U2 - 10.1145/3712285.3759878
DO - 10.1145/3712285.3759878
M3 - Conference contribution
AN - SCOPUS:105023974198
T3 - Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2025
SP - 1992
EP - 2005
BT - Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2025
PB - Association for Computing Machinery, Inc
T2 - 2025 International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2025
Y2 - 16 November 2025 through 21 November 2025
ER -