Corporation Consumer Carrier Home and Enterprise

Language

简体中文 English

A Hadoop Performance Prediction Model Based on Random Forest

Release Date：2013-07-24 Author：Zhendong Bei, Zhibin Yu, Huiling Zhang, Chengzhong Xu, Shenzhong Feng, Zhenjiang Dong, and Hengsheng Zhang

[Abstract] MapReduce is a programming model for processing large data sets, and Hadoop is the most popular open-source implementation of MapReduce. To achieve high performance, up to 190 Hadoop configuration parameters must be manually tunned. This is not only time-consuming but also error-pron. In this paper, we propose a new performance model based on random forest, a recently developed machine-learning algorithm. The model, called RFMS, is used to predict the performance of a Hadoop system according to the system’s configuration parameters. RFMS is created from 2000 distinct fine-grained performance observations with different Hadoop configurations. We test RFMS against the measured performance of representative workloads from the Hadoop Micro-benchmark suite. The results show that the prediction accuracy of RFMS achieves 95% on average and up to 99%. This new, highly accurate prediction model can be used to automatically optimize the performance of Hadoop systems.

[Keywords] big data; cloud computing; MapReduce; Hadoop; random forest; micro-benchmark

Related Articles

Content Centric Networking: A New Approach to Big Data Distribution

Big-Data Analytics: Challenges, Key Technologies and Prospects

Data Security and Privacy in Cloud Storage

An Efficient Dynamic Proof of Retrievability Scheme

SPBD: Streamlining Big-Data Processing in Cloud Environments