PreDatA - Preparatory data analytics on peta-scale machines

Fang Zheng, Hasan Abbasi, Ciprian Docan, Jay Lofstead, Qing Liu, Scott Klasky, Manish Parashar, Norbert Podhorszki, Karsten Schwan, Matthew Wolf

Research output: Chapter in Book/Report/Conference proceedingConference contribution

130 Scopus citations

Abstract

Peta-scale scientific applications running on High End Computing (HEC) platforms can generate large volumes of data. For high performance storage and in order to be useful to science end users, such data must be organized in its layout, indexed, sorted, and otherwise manipulated for subsequent data presentation, visualization, and detailed analysis. In addition, scientists desire to gain insights into selected data characteristics 'hidden' or 'latent' in these massive datasets while data is being produced by simulations. PreDatA, short for Preparatory Data Analytics, is an approach to preparing and characterizing data while it is being produced by the large scale simulations running on peta-scale machines. By dedicating additional compute nodes on the machine as 'staging' nodes and by staging simulations' output data through these nodes, PreDatA can exploit their computational power to perform select data manipulations with lower latency than attainable by first moving data into file systems and storage. Such in- transit manipulations are supported by the PreDatA middleware through asynchronous data movement to reduce write latency, application-specific operations on streaming data that are able to discover latent data characteristics, and appropriate data reorganization and metadata annotation to speed up subsequent data access. PreDatA enhances the scalability and flexibility of the current I/O stack on HEC platforms and is useful for data pre-processing, runtime data analysis and inspection, as well as for data exchange between concurrently running simulations.

Original languageEnglish (US)
Title of host publicationProceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010
DOIs
StatePublished - 2010
Externally publishedYes
Event24th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2010 - Atlanta, GA, United States
Duration: Apr 19 2010Apr 23 2010

Publication series

NameProceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010

Other

Other24th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2010
Country/TerritoryUnited States
CityAtlanta, GA
Period4/19/104/23/10

All Science Journal Classification (ASJC) codes

  • Computational Theory and Mathematics
  • Software
  • Theoretical Computer Science

Fingerprint

Dive into the research topics of 'PreDatA - Preparatory data analytics on peta-scale machines'. Together they form a unique fingerprint.

Cite this