TY - JOUR
T1 - An enhanced parallelisation model for performance prediction of apache spark on a multinode hadoop cluster
AU - Ahmed, Nasim
AU - Barczak, Andre L.C.
AU - Rashid, Mohammad A.
AU - Susnjak, Teo
N1 - Funding Information:
Acknowledgments: This work was supported in part by the Massey University Doctoral Scholarship.
Publisher Copyright:
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).
PY - 2021/12
Y1 - 2021/12
N2 - Big data frameworks play a vital role in storing, processing, and analysing large datasets. Apache Spark has been established as one of the most popular big data engines for its efficiency and reliability. However, one of the significant problems of the Spark system is performance prediction. Spark has more than 150 configurable parameters, and configuration of so many parameters is challenging task when determining the suitable parameters for the system. In this paper, we proposed two distinct parallelisation models for performance prediction. Our insight is that each node in a Hadoop cluster can communicate with identical nodes, and a certain function of the non-parallelisable runtime can be estimated accordingly. Both models use simple equations that allows us to predict the runtime when the size of the job and the number of executables are known. The proposed models were evaluated based on five HiBench workloads, Kmeans, PageRank, Graph (NWeight), SVM, and WordCount. The workload’s empirical data were fitted with one of the two models meeting the accuracy requirements. Finally, the experimental findings show that the model can be a handy and helpful tool for scheduling and planning system deployment.
AB - Big data frameworks play a vital role in storing, processing, and analysing large datasets. Apache Spark has been established as one of the most popular big data engines for its efficiency and reliability. However, one of the significant problems of the Spark system is performance prediction. Spark has more than 150 configurable parameters, and configuration of so many parameters is challenging task when determining the suitable parameters for the system. In this paper, we proposed two distinct parallelisation models for performance prediction. Our insight is that each node in a Hadoop cluster can communicate with identical nodes, and a certain function of the non-parallelisable runtime can be estimated accordingly. Both models use simple equations that allows us to predict the runtime when the size of the job and the number of executables are known. The proposed models were evaluated based on five HiBench workloads, Kmeans, PageRank, Graph (NWeight), SVM, and WordCount. The workload’s empirical data were fitted with one of the two models meeting the accuracy requirements. Finally, the experimental findings show that the model can be a handy and helpful tool for scheduling and planning system deployment.
UR - http://www.scopus.com/inward/record.url?scp=85119332827&partnerID=8YFLogxK
U2 - 10.3390/bdcc5040065
DO - 10.3390/bdcc5040065
M3 - Article
AN - SCOPUS:85119332827
SN - 2504-2289
VL - 5
JO - Big Data and Cognitive Computing
JF - Big Data and Cognitive Computing
IS - 4
M1 - 65
ER -