On a Small File Merger for Fast Access and Modifiability of Small Files in HDFS

Dingchao Chen, Chase Q. Wu, Wei Shen, Yu Zhang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Hadoop Distributed File System (HDFS) was originally designed to store big files and has been widely used in big-data ecosystem. However, it may suffer from serious performance issues when handling a large number of small files. In this paper, we propose a novel archive system, referred to as Small File Merger (SFM), to solve small file problems in HDFS. The key idea is to combine small files into large ones and build an index for accessing original files. Unlike traditional archive systems such as Hadoop Archives (Har), SFM allows modification of archived files directly without re-archiving. Considering that most of the reads in HDFS are sequential, we design an adaptive readahead strategy based on the Simultaneous Perturbation Stochastic Approximation (SPSA) algorithm to maximize read performance. Furthermore, our system provides an HDFS-compatible interface, which can be used directly without recompiling and redeploying the existing HDFS cluster, hence facilitating convenient deployment for practical use. Preliminary experimental results show that our system achieves better performance than existing methods.

Original languageEnglish (US)
Title of host publication2021 IEEE/ACS 18th International Conference on Computer Systems and Applications, AICCSA 2021 - Proceedings
PublisherIEEE Computer Society
ISBN (Electronic)9781665409698
DOIs
StatePublished - 2021
Externally publishedYes
Event18th IEEE/ACS International Conference on Computer Systems and Applications, AICCSA 2021 - Virtual, Online, Morocco
Duration: Nov 30 2021Dec 3 2021

Publication series

NameProceedings of IEEE/ACS International Conference on Computer Systems and Applications, AICCSA
Volume2021-December
ISSN (Print)2161-5322
ISSN (Electronic)2161-5330

Conference

Conference18th IEEE/ACS International Conference on Computer Systems and Applications, AICCSA 2021
Country/TerritoryMorocco
CityVirtual, Online
Period11/30/2112/3/21

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Computer Science Applications
  • Hardware and Architecture
  • Signal Processing
  • Control and Systems Engineering
  • Electrical and Electronic Engineering

Keywords

  • Adaptive Readahead
  • Archive System
  • Big Data
  • HDFS
  • Small File Problems
  • Stochastic Approximation

Fingerprint

Dive into the research topics of 'On a Small File Merger for Fast Access and Modifiability of Small Files in HDFS'. Together they form a unique fingerprint.

Cite this