XML2HBase: Storing and querying large collections of XML documents using a NoSQL database system

Liang Bao, Jin Yang, Chase Q. Wu, Haiyang Qi, Xin Zhang, Shunda Cai

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Many big data applications such as smart transportation, healthcare, and e-commerce need to store and query large collections of small XML documents, which has become a fundamental problem. However, existing solutions are inadequate to deliver satisfactory query performance in such circumstances. In this paper, we propose a framework named XML2HBase to address this problem using HBase, a widely deployed NoSQL database. Within this framework, we design a novel encoding scheme called Pathed-Dewey Order and a two-layer mapping method to store XML documents in HBase tables. XML queries, which are represented as XPath expressions, are evaluated through their translation into queries over HBase tables. Based on an in-depth analysis of the characteristics of the proposed approach, we design and integrate four optimization strategies to reduce storage space and query response time. Extensive experiments on two well-known XML benchmarks demonstrate the superior performance of XML2HBase over three state-of-the-art methods.

Original languageEnglish (US)
Pages (from-to)83-99
Number of pages17
JournalJournal of Parallel and Distributed Computing
Volume161
DOIs
StatePublished - Mar 2022

All Science Journal Classification (ASJC) codes

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture
  • Computer Networks and Communications
  • Artificial Intelligence

Keywords

  • NoSQL database
  • XML data mapping
  • XML encoding scheme
  • XML query processing and optimization

Fingerprint

Dive into the research topics of 'XML2HBase: Storing and querying large collections of XML documents using a NoSQL database system'. Together they form a unique fingerprint.

Cite this