TY - JOUR
T1 - XML2HBase
T2 - Storing and querying large collections of XML documents using a NoSQL database system
AU - Bao, Liang
AU - Yang, Jin
AU - Wu, Chase Q.
AU - Qi, Haiyang
AU - Zhang, Xin
AU - Cai, Shunda
N1 - Funding Information:
Liang Bao is currently a Professor in the School of Computer Science and Technology at Xidian University. His research interests include big data, cloud computing and software engineering. His research in computing develops machine learning based solutions to predict the execution of and optimize the performance of big data analytics frameworks in public cloud computing environments. His research in cloud computing and big data develops a big data analytics framework to help users to find implicit patterns or rules from large collections of IoT datasets. Dr. Bao's work has been supported by various funding agencies, including the National Science Foundation of China, the Ministry of Industry and Information of China, the Ministry of Science and Technology of China, and many companies and research institutes. He has published over 20 research articles in highly reputed conference proceedings and journals.
Funding Information:
This work is supported by the National Natural Science Foundation of China under Grant No. 62172316 , the Ministry of Education Humanities and Social Science Project of China (Grant No. 17YJA790047 ), and the Soft Science Research Plans of Shaanxi Province (Grant No. 2020KRZ018 ). This work is also supported by the Research Project on Major Theoretical and Practical Problems of Philosophy and Social Sciences in Shaanxi Province under Grant No. 20JZ-25 , the Key R&D Program of Shaanxi under Grant No. 2019ZDLGY13-03-02 , the Natural Science Foundation of Shaanxi Province under Grant No. 2019JM-368 and the Key R&D Program of Hebei under Grant No. 20310102D .
Publisher Copyright:
© 2021 Elsevier Inc.
PY - 2022/3
Y1 - 2022/3
N2 - Many big data applications such as smart transportation, healthcare, and e-commerce need to store and query large collections of small XML documents, which has become a fundamental problem. However, existing solutions are inadequate to deliver satisfactory query performance in such circumstances. In this paper, we propose a framework named XML2HBase to address this problem using HBase, a widely deployed NoSQL database. Within this framework, we design a novel encoding scheme called Pathed-Dewey Order and a two-layer mapping method to store XML documents in HBase tables. XML queries, which are represented as XPath expressions, are evaluated through their translation into queries over HBase tables. Based on an in-depth analysis of the characteristics of the proposed approach, we design and integrate four optimization strategies to reduce storage space and query response time. Extensive experiments on two well-known XML benchmarks demonstrate the superior performance of XML2HBase over three state-of-the-art methods.
AB - Many big data applications such as smart transportation, healthcare, and e-commerce need to store and query large collections of small XML documents, which has become a fundamental problem. However, existing solutions are inadequate to deliver satisfactory query performance in such circumstances. In this paper, we propose a framework named XML2HBase to address this problem using HBase, a widely deployed NoSQL database. Within this framework, we design a novel encoding scheme called Pathed-Dewey Order and a two-layer mapping method to store XML documents in HBase tables. XML queries, which are represented as XPath expressions, are evaluated through their translation into queries over HBase tables. Based on an in-depth analysis of the characteristics of the proposed approach, we design and integrate four optimization strategies to reduce storage space and query response time. Extensive experiments on two well-known XML benchmarks demonstrate the superior performance of XML2HBase over three state-of-the-art methods.
KW - NoSQL database
KW - XML data mapping
KW - XML encoding scheme
KW - XML query processing and optimization
UR - http://www.scopus.com/inward/record.url?scp=85120451174&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85120451174&partnerID=8YFLogxK
U2 - 10.1016/j.jpdc.2021.11.003
DO - 10.1016/j.jpdc.2021.11.003
M3 - Article
AN - SCOPUS:85120451174
SN - 0743-7315
VL - 161
SP - 83
EP - 99
JO - Journal of Parallel and Distributed Computing
JF - Journal of Parallel and Distributed Computing
ER -