TY - GEN
T1 - Canopus
T2 - 2017 IEEE International Conference on Cluster Computing, CLUSTER 2017
AU - Lu, Tao
AU - Suchyta, Eric
AU - Pugmire, Dave
AU - Choi, Jong
AU - Klasky, Scott
AU - Liu, Qing
AU - Podhorszki, Norbert
AU - Ainsworth, Mark
AU - Wolf, Matthew
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/9/22
Y1 - 2017/9/22
N2 - Scientific simulations on high performance computing (HPC) platforms generate large quantities of data. To bridge the widening gap between compute and I/O, and enable data to be more efficiently stored and analyzed, simulation outputs need to be refactored, reduced, and appropriately mapped to storage tiers. However, a systematic solution to support these steps has been lacking on the current HPC software ecosystem. To that end, this paper develops Canopus, a progressive JPEGlike data management scheme for storing and analyzing big scientific data. It co-designs the data decimation, compression and data storage, taking the hardware characteristics of each storage tier into considerations. With reasonably low overhead, our approach refactors simulation data into a much smaller, reduced-Accuracy base dataset, and a series of deltas that is used to augment the accuracy if needed. The base dataset and deltas are compressed and written to multiple storage tiers. Data saved on different tiers can then be selectively retrieved to restore the level of accuracy that satisfies data analytics. Thus, Canopus provides a paradigm shift towards elastic data analytics and enables end users to make trade-offs between analysis speed and accuracy on-The-fly. We evaluate the impact of Canopus on unstructured triangular meshes, a pervasive data model used by scientific modeling and simulations. In particular, we demonstrate the progressive data exploration of Canopus using the 'blob detection' use case on the fusion simulation data.
AB - Scientific simulations on high performance computing (HPC) platforms generate large quantities of data. To bridge the widening gap between compute and I/O, and enable data to be more efficiently stored and analyzed, simulation outputs need to be refactored, reduced, and appropriately mapped to storage tiers. However, a systematic solution to support these steps has been lacking on the current HPC software ecosystem. To that end, this paper develops Canopus, a progressive JPEGlike data management scheme for storing and analyzing big scientific data. It co-designs the data decimation, compression and data storage, taking the hardware characteristics of each storage tier into considerations. With reasonably low overhead, our approach refactors simulation data into a much smaller, reduced-Accuracy base dataset, and a series of deltas that is used to augment the accuracy if needed. The base dataset and deltas are compressed and written to multiple storage tiers. Data saved on different tiers can then be selectively retrieved to restore the level of accuracy that satisfies data analytics. Thus, Canopus provides a paradigm shift towards elastic data analytics and enables end users to make trade-offs between analysis speed and accuracy on-The-fly. We evaluate the impact of Canopus on unstructured triangular meshes, a pervasive data model used by scientific modeling and simulations. In particular, we demonstrate the progressive data exploration of Canopus using the 'blob detection' use case on the fusion simulation data.
KW - Data analytics
KW - HPC
KW - Storage
UR - http://www.scopus.com/inward/record.url?scp=85032626321&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85032626321&partnerID=8YFLogxK
U2 - 10.1109/CLUSTER.2017.62
DO - 10.1109/CLUSTER.2017.62
M3 - Conference contribution
AN - SCOPUS:85032626321
T3 - Proceedings - IEEE International Conference on Cluster Computing, ICCC
SP - 58
EP - 69
BT - Proceedings - 2017 IEEE International Conference on Cluster Computing, CLUSTER 2017
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 5 September 2017 through 8 September 2017
ER -