Performance Analysis of Multi-Node Hadoop Cluster Based on Large Data Sets

N. Ahmed, Andre L.C. Barczak, Sibghat Ullah Bazai, Teo Susnjak, Mohammed A. Rashid

Research output: Chapter in Book/Report/Conference proceedingConference contributionResearchpeer-review

2 Citations (Scopus)

Abstract

The purpose of this paper is to assess the performance of a Hadoop cluster in MapReduce and Spark using two different clusters, one with 5 slave nodes, and another with 9 slave nodes. For the experiment, the HiBenchmark workloads WordCount and TeraSort are used with varied data scale from 50GB to 600GB. We have chosen a few different parameters and replaced their default values with the tuned values, allowing us to analyze the effects of such changes in each job's runtime. The results show that for both WordCount and Terasort workloads, depending on the tuned parameters, MapReduce and Spark achieved 64% and around 60% performance improvement at each data point. Besides, we also got slightly interesting results of speed-up progress by 1% using extra slave nodes. These results show that cluster performance can be improved by changing default values of a few parameters and adding additional slave nodes.

Original languageEnglish
Title of host publication2020 IEEE Asia-Pacific Conference on Computer Science and Data Engineering, CSDE 2020
PublisherIEEE, Institute of Electrical and Electronics Engineers
ISBN (Electronic)9781665419741
DOIs
Publication statusPublished - 16 Dec 2020
Externally publishedYes
Event2020 IEEE Asia-Pacific Conference on Computer Science and Data Engineering, CSDE 2020 - Gold Coast, Australia
Duration: 16 Dec 202018 Dec 2020

Publication series

Name2020 IEEE Asia-Pacific Conference on Computer Science and Data Engineering, CSDE 2020

Conference

Conference2020 IEEE Asia-Pacific Conference on Computer Science and Data Engineering, CSDE 2020
Country/TerritoryAustralia
CityGold Coast
Period16/12/2018/12/20

Fingerprint

Dive into the research topics of 'Performance Analysis of Multi-Node Hadoop Cluster Based on Large Data Sets'. Together they form a unique fingerprint.

Cite this