TY - GEN
T1 - Six degrees of scientific data
T2 - 20th ACM International Symposium on High-Performance Parallel and Distributed Computing, HPDC'11
AU - Lofstead, Jay
AU - Polte, Milo
AU - Gibson, Garth
AU - Klasky, Scott
AU - Schwan, Karsten
AU - Oldfield, Ron
AU - Wolf, Matthew
AU - Liu, Qing
PY - 2011
Y1 - 2011
N2 - Petascale science simulations generate 10s of TBs of application data per day, much of it devoted to their checkpoint/restart fault tolerance mechanisms. Previous work demonstrated the importance of carefully managing such output to prevent application slowdown due to IO blocking, resource contention negatively impacting simulation performance and to fully exploit the IO bandwidth available to the petascale machine. This paper takes a further step in understanding and managing extreme-scale IO. Specifically, its evaluations seek to understand how to efficiently read data for subsequent data analysis, visualization, checkpoint restart after a failure, and other read-intensive operations. In their entirety, these actions support the 'end-to-end' needs of scientists enabling the scientific processes being undertaken. Contributions include the following. First, working with application scientists, we define 'read' benchmarks that capture the common read patterns used by analysis codes. Second, these read patterns are used to evaluate different IO techniques at scale to understand the effects of alternative data sizes and organizations in relation to the performance seen by end users. Third, defining the novel notion of a 'data district' to characterize how data is organized for reads, we experimentally compare the read performance seen with the ADIOS middleware's log-based BP format to that seen by the logically contiguous NetCDF or HDF5 formats commonly used by analysis tools. Measurements assess the performance seen across patterns and with different data sizes, organizations, and read process counts. Outcomes demonstrate that high end-to-end IO performance requires data organizations that offer flexibility in data layout and placement on parallel storage targets, including in ways that can make tradeoffs in the performance of data writes vs. reads.
AB - Petascale science simulations generate 10s of TBs of application data per day, much of it devoted to their checkpoint/restart fault tolerance mechanisms. Previous work demonstrated the importance of carefully managing such output to prevent application slowdown due to IO blocking, resource contention negatively impacting simulation performance and to fully exploit the IO bandwidth available to the petascale machine. This paper takes a further step in understanding and managing extreme-scale IO. Specifically, its evaluations seek to understand how to efficiently read data for subsequent data analysis, visualization, checkpoint restart after a failure, and other read-intensive operations. In their entirety, these actions support the 'end-to-end' needs of scientists enabling the scientific processes being undertaken. Contributions include the following. First, working with application scientists, we define 'read' benchmarks that capture the common read patterns used by analysis codes. Second, these read patterns are used to evaluate different IO techniques at scale to understand the effects of alternative data sizes and organizations in relation to the performance seen by end users. Third, defining the novel notion of a 'data district' to characterize how data is organized for reads, we experimentally compare the read performance seen with the ADIOS middleware's log-based BP format to that seen by the logically contiguous NetCDF or HDF5 formats commonly used by analysis tools. Measurements assess the performance seen across patterns and with different data sizes, organizations, and read process counts. Outcomes demonstrate that high end-to-end IO performance requires data organizations that offer flexibility in data layout and placement on parallel storage targets, including in ways that can make tradeoffs in the performance of data writes vs. reads.
KW - adios
KW - analysis
KW - hdf5
KW - log-based
KW - logically contiguous
KW - netcdf
KW - pnetcdf
KW - visualization
UR - http://www.scopus.com/inward/record.url?scp=79960527409&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79960527409&partnerID=8YFLogxK
U2 - 10.1145/1996130.1996139
DO - 10.1145/1996130.1996139
M3 - Conference contribution
AN - SCOPUS:79960527409
SN - 9781450305525
T3 - Proceedings of the IEEE International Symposium on High Performance Distributed Computing
SP - 49
EP - 60
BT - HPDC'11 - Proceedings of the 20th International Symposium on High Performance Distributed Computing
Y2 - 8 June 2011 through 11 June 2011
ER -