A memory and time scalable parallelization of the reptile error-correction code

Vipin Sachdeva, Srinivas Aluru, David A. Bader

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Scopus citations

Abstract

This paper details a distributed memory implementation of Reptile, a scalable and accurate spectrum based error-correction method. Reptile uses both k-mer and adjoining k-mers (called tiles) information along with the quality scores of bases to correct substitution-based errors from next generation sequencing machines. Previous approaches to parallelize Preptile have replicated the spectrums on each node which can be prohibitive in terms of memory needed for huge datasets. Our approach distributes both the k-mer and the tile spectrum amongst the processing ranks, relying on message passing for error correction. This allows hardware with any memory size per node to be employed for error-correction using Reptile's algorithm, irrespective of the size of the dataset. As part of our implementation, we have also implemented several heuristics which can be used to run the algorithm optimally based on the advantages of the hardware used. We present our results on IBM's BlueGene/Q architecture for the E.Coli, Drosophila and the human datasets showing excellent scalability with increasing number of nodes. Using 256 nodes of BlueGene/Q, we are able to error correct E.Coli and Drosphila datasets in less than 200 seconds and 600 seconds respectively. The human dataset consisting of 1.55 billion reads is corrected in a little more than two hours using 1024 nodes of BlueGene/Q. All three datasets are corrected with Reptile's memory intensive algorithm with less than 512 MB per process.

Original languageEnglish (US)
Title of host publicationProceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages453-462
Number of pages10
ISBN (Electronic)9781509021406
DOIs
StatePublished - Jul 18 2016
Externally publishedYes
Event30th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2016 - Chicago, United States
Duration: May 23 2016May 27 2016

Publication series

NameProceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016

Conference

Conference30th IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2016
Country/TerritoryUnited States
CityChicago
Period5/23/165/27/16

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications

Keywords

  • Big data
  • Distributed
  • HPC
  • Hash table
  • K-mer

Fingerprint

Dive into the research topics of 'A memory and time scalable parallelization of the reptile error-correction code'. Together they form a unique fingerprint.

Cite this