TY - GEN
T1 - On distributed information composition in big data systems
AU - Alquwaiee, Haifa
AU - He, Songlin
AU - Wu, Chase
AU - Tang, Qiang
AU - Shen, Xuewen
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/9
Y1 - 2019/9
N2 - Modern big data computing systems exemplified by Hadoop employ parallel processing based on distributed storage. The results produced by parallel tasks such as computing modules in scientific workflows or reducers in the MapReduce framework are typically stored in a distributed file system across multiple data nodes. However, most existing systems do not provide a mechanism to compose such distributed information, as required by many big data applications. We construct analytical cost models and formulate a Distributed Information Composition problem in Big Data Systems, referred to as DIC-BDS, to aggregate multiple datasets stored as data blocks in Hadoop Distributed File System (HDFS) using a composition operator of specific complexity to produce one final output. We rigorously prove that DIC-BDS is NP-complete, and propose two heuristic algorithms: Fixed-window Distributed Composition Scheme (FDCS) and Dynamic-window Distributed Composition Scheme with Delay (DDCS-D). We conduct extensive experiments in Google clouds with various composition operators of commonly considered degrees of complexity including O(n), O(n log n), and O(n2). Experimental results illustrate the performance superiority of the proposed solutions over existing methods. Specifically, FDCS outperforms all other algorithms in comparison with a composition operator of complexity O(n) or O(n log n), while DDCS-D achieves the minimum total composition time with a composition operator of complexity O(n2). These algorithms provide an additional level of data processing for efficient information aggregation in existing workflow and big data systems.
AB - Modern big data computing systems exemplified by Hadoop employ parallel processing based on distributed storage. The results produced by parallel tasks such as computing modules in scientific workflows or reducers in the MapReduce framework are typically stored in a distributed file system across multiple data nodes. However, most existing systems do not provide a mechanism to compose such distributed information, as required by many big data applications. We construct analytical cost models and formulate a Distributed Information Composition problem in Big Data Systems, referred to as DIC-BDS, to aggregate multiple datasets stored as data blocks in Hadoop Distributed File System (HDFS) using a composition operator of specific complexity to produce one final output. We rigorously prove that DIC-BDS is NP-complete, and propose two heuristic algorithms: Fixed-window Distributed Composition Scheme (FDCS) and Dynamic-window Distributed Composition Scheme with Delay (DDCS-D). We conduct extensive experiments in Google clouds with various composition operators of commonly considered degrees of complexity including O(n), O(n log n), and O(n2). Experimental results illustrate the performance superiority of the proposed solutions over existing methods. Specifically, FDCS outperforms all other algorithms in comparison with a composition operator of complexity O(n) or O(n log n), while DDCS-D achieves the minimum total composition time with a composition operator of complexity O(n2). These algorithms provide an additional level of data processing for efficient information aggregation in existing workflow and big data systems.
KW - Big data
KW - Distributed algorithms
KW - Information composition
KW - Task scheduling
UR - http://www.scopus.com/inward/record.url?scp=85083267334&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85083267334&partnerID=8YFLogxK
U2 - 10.1109/eScience.2019.00025
DO - 10.1109/eScience.2019.00025
M3 - Conference contribution
AN - SCOPUS:85083267334
T3 - Proceedings - IEEE 15th International Conference on eScience, eScience 2019
SP - 168
EP - 177
BT - Proceedings - IEEE 15th International Conference on eScience, eScience 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 15th IEEE International Conference on eScience, eScience 2019
Y2 - 24 September 2019 through 27 September 2019
ER -