Canopus: Enabling extreme-scale data analytics on big HPC storage via progressive refactoring

Tao Lu, Eric Suchyta, Jong Choi, Norbert Podhorszki, Scott Klasky, Qing Liu, Dave Pugmire, Matthew Wolf, Mark Ainsworth

Research output: Contribution to conferencePaperpeer-review

7 Scopus citations

Abstract

High accuracy scientific simulations on high performance computing (HPC) platforms generate large amounts of data. To allow data to be efficiently analyzed, simulation outputs need to be refactored, compressed, and properly mapped onto storage tiers. This paper presents Canopus, a progressive data management framework for storing and analyzing big scientific data. Canopus allows simulation results to be refactored into a much smaller dataset along with a series of deltas with fairly low overhead. Then, the refactored data are compressed, mapped, and written onto storage tiers. For data analytics, refactored data are selectively retrieved to restore data at a specific level of accuracy that satisfies analysis requirements. Canopus enables end users to make trade-offs between analysis speed and accuracy on-the-fly. Canopus is demonstrated and thoroughly evaluated using blob detection on fusion simulation data.

Original languageEnglish (US)
StatePublished - 2017
Event9th USENIX Workshop on Hot Topics in Storage and File Systems, HotStorage 2017, co-located with USENIX ATC 2017 - Santa Clara, United States
Duration: Jul 10 2017Jul 11 2017

Conference

Conference9th USENIX Workshop on Hot Topics in Storage and File Systems, HotStorage 2017, co-located with USENIX ATC 2017
Country/TerritoryUnited States
CitySanta Clara
Period7/10/177/11/17

All Science Journal Classification (ASJC) codes

  • Hardware and Architecture
  • Information Systems
  • Software
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Canopus: Enabling extreme-scale data analytics on big HPC storage via progressive refactoring'. Together they form a unique fingerprint.

Cite this