Many large-scale applications in various business and scientific domains require both parallel computing and distributed data management for big data processing. One typical scenario is the use of the Spark computing engine to process a large amount of data managed by HBase in Hadoop. Such computing workflows provide an opportunity to optimize application performance through strategic resource allocation with suitable parameter settings. As such, it necessitates accurate modeling and prediction of application performance to provide an effective recommendation of optimal system configurations to end users. However, this is a challenging problem for multiple reasons, mainly the large parameter space and the dynamic interactions between different technology layers of big data systems. In this paper, we propose a class of regression-based machine learning models to predict the execution performance of Spark-HBase applications in Hadoop. We first explore and identify an exhaustive set of system parameters across multiple layers including Spark and HBase, and then conduct in-depth exploratory analysis of their effects on the execution time of Spark-HBase applications. Based on these analysis results, we design a performance predictor using regression-based machine learning algorithms. Experimental results show that the resulted predictor achieves high accuracy with different algorithms in comparison. The proposed approach can facilitate automatic system configurations and has potential to be applied to other similar systems for big data processing.