TY - GEN
T1 - Performance Prediction of Big Data Transfer Through Experimental Analysis and Machine Learning
AU - Yun, Daqing
AU - Liu, Wuji
AU - Wu, Chase Q.
AU - Rao, Nageswara S.V.
AU - Kettimuthu, Rajkumar
N1 - Funding Information:
VII. CONCLUSION AND FUTURE WORK We conducted an in-depth exploratory analysis of the impacts of a comprehensive set of factors on the end-to-end performance of big data transfer based on extensive performance measurements collected on several real-life physical or emulated HPN testbeds. Based on such analysis, we selected features and built a performance predictor using machine learning. We verified the feasibility and effectiveness of the learning-based performance predictor through theoretical performance bound analysis. The experimental results show that, with appropriate data preprocessing, the predictor is able to achieve satisfactory accuracy based on very noisy datasets. We plan to use “advanced” bagging-and boosting-based machine learning algorithms to perform such prediction and compare their performance. It is also of our interest to study and derive tighter performance bounds on the estimation loss and sample size by incorporating other HPN domain insights. ACKNOWLEDGMENT This research is sponsored by Harrisburg University and the U.S. National Science Foundation under Grant No. CNS-1828123 with New Jersey Institute of Technology. REFERENCES [1] ESnet. http://www.es.net. [2] Iperf3. https://bit.ly/2rpe6SW. [3] M. Anthony and P. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, New York, 2009. [4] D. Yun et al. Profiling transport performance for big data transfer over dedicated channels. In Proc. of ICNC, pages 858–862, 2015. [5] D. Yun et al. Advising big data transfer over dedicated connections based on profiling optimization. IEEE/ACM Trans. Netw., 27(6):2280– 2293, 2019. [6] F. Pedregosa et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. [7] S. Floyd. Highspeed TCP for large congestion windows. RFC 3649. [8] Y. Gu and R. Grossman. UDT: UDP-based data transfer for high-speed wide area networks. Comput. Netw., 51(7):1777–1799, 2007. [9] S. Ha, I. Rhee, and L. Xu. CUBIC: A new TCP-friendly high-speed TCP variant. ACM SIGOPS Operat. Syst. Rev., 42(5):64–74, 2008. [10] J. Padhye et al. Modeling TCP Reno performance: A simple model and its empirical validation. IEEE/ACM Trans. Netw., 8(2):133–145, 2000. [11] T. Kelly. Scalable TCP: Improving performance in highspeed wide area networks. ACM SIGCOMM Comput. Commun. Rev., 33(2):83–91, 2003. [12] B. Leitao. Tuning 10Gb network cards on Linux. In Proc. of Linux Symp., pages 169–184, 2009. [13] M. Mirza et al. A machine learning approach to TCP throughput prediction. IEEE/ACM Trans. Netw., 18(4):1026–1039, 2010. [14] N. Hanford et al. Analysis of the effect of core affinity on high-throughput flows. In Proc. of NDM, pages 9–15, 2014. [15] N. Rao et al. TCP throughput profiles using measurements over dedicated connections. In Proc. of HPDC, pages 193–204, 2017. [16] Q. Liu et al. Measurement-based performance profiles and dynamics of UDT over dedicated connections. In Proc. of ICNP, 2016. [17] R. Kettimuthu et al. An elegant sufficiency: Load-aware differentiated scheduling of data transfers. In Proc. of SC, Article 46, 2005. [18] N. Rao. Simple sample bound for feedforward sigmoid networks with bounded weights. Neurocomputing, 29(1):115–122, 1999. [19] V. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, New York, 1982. [20] Y. Gu et al. An analysis of AIMD algorithm with decreasing increases. In Proc. of the 1st Int’l Workshop on Netw. for Grid Appl., 2004. [21] Z. Liu et al. Explaining wide area data transfer performance. In Proc. of HPDC, pages 167–178, 2017. [22] Z. Liu et al. Building a wide-area data transfer performance predictor: An empirical study. In Proc. of the 1st Int’l Conf. on Machine Learning for Netw., 2018.
Funding Information:
This research is sponsored by Harrisburg University and the U.S. National Science Foundation under Grant No. CNS-1828123 with New Jersey Institute of Technology.
Publisher Copyright:
© 2020 IFIP.
PY - 2020/6
Y1 - 2020/6
N2 - Big data transfer in next-generation scientific applications is now commonly carried out over connections with guaranteed bandwidth provisioned in High-performance Networks (HPNs) through advance bandwidth reservation. To use HPN resources efficiently, provisioning agents need to carefully schedule data transfer requests and allocate appropriate bandwidths. Such reserved bandwidths, if not fully utilized by the requesting user, could be simply wasted or cause extra overhead and complexity in management due to exclusive access. This calls for the capability of performance prediction to reserve bandwidth resources that match actual needs. Towards this goal, we employ machine learning algorithms to predict big data transfer performance based on extensive performance measurements, which are collected over a span of several years from a large number of data transfer tests using different protocols and toolkits between various end sites on several real-life physical or emulated HPN testbeds. We first identify a comprehensive list of attributes involved in a typical big data transfer process, including end host system configurations, network connection properties, and control parameters of data transfer methods. We then conduct an in-depth exploratory analysis of their impacts on application-level throughput, which provides insights into big data transfer performance and motivates the use of machine learning. We also investigate the applicability of machine learning algorithms and derive their general performance bounds for performance prediction of big data transfer in HPNs. Experimental results show that, with appropriate data preprocessing, the proposed machine learning-based approach achieves 95% or higher prediction accuracy in up to 90% of the cases with very noisy real-life performance measurements.
AB - Big data transfer in next-generation scientific applications is now commonly carried out over connections with guaranteed bandwidth provisioned in High-performance Networks (HPNs) through advance bandwidth reservation. To use HPN resources efficiently, provisioning agents need to carefully schedule data transfer requests and allocate appropriate bandwidths. Such reserved bandwidths, if not fully utilized by the requesting user, could be simply wasted or cause extra overhead and complexity in management due to exclusive access. This calls for the capability of performance prediction to reserve bandwidth resources that match actual needs. Towards this goal, we employ machine learning algorithms to predict big data transfer performance based on extensive performance measurements, which are collected over a span of several years from a large number of data transfer tests using different protocols and toolkits between various end sites on several real-life physical or emulated HPN testbeds. We first identify a comprehensive list of attributes involved in a typical big data transfer process, including end host system configurations, network connection properties, and control parameters of data transfer methods. We then conduct an in-depth exploratory analysis of their impacts on application-level throughput, which provides insights into big data transfer performance and motivates the use of machine learning. We also investigate the applicability of machine learning algorithms and derive their general performance bounds for performance prediction of big data transfer in HPNs. Experimental results show that, with appropriate data preprocessing, the proposed machine learning-based approach achieves 95% or higher prediction accuracy in up to 90% of the cases with very noisy real-life performance measurements.
KW - Performance prediction
KW - big data transfer
KW - experimental analysis
KW - machine learning
UR - http://www.scopus.com/inward/record.url?scp=85090047916&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85090047916&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85090047916
T3 - IFIP Networking 2020 Conference and Workshops, Networking 2020
SP - 181
EP - 189
BT - IFIP Networking 2020 Conference and Workshops, Networking 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2020 IFIP Networking Conference and Workshops, Networking 2020
Y2 - 22 June 2020 through 25 June 2020
ER -