A Hadoop Performance Prediction Model Based on Random Forest

Release Date:2013-07-24 Author:Zhendong Bei, Zhibin Yu, Huiling Zhang, Chengzhong Xu, Shenzhong Feng, Zhenjiang Dong, and Hengsheng Zhang Click:

[Abstract] MapReduce is a programming model for processing large data sets, and Hadoop is the most popular open-source implementation of MapReduce. To achieve high performance, up to 190 Hadoop configuration parameters must be manually tunned. This is not only time-consuming but also error-pron. In this paper, we propose a new performance model based on random forest, a recently developed machine-learning algorithm. The model, called RFMS, is used to predict the performance of a Hadoop system according to the system’s configuration parameters. RFMS is created from 2000 distinct fine-grained performance observations with different Hadoop configurations. We test RFMS against the measured performance of representative workloads from the Hadoop Micro-benchmark suite. The results show that the prediction accuracy of RFMS achieves 95% on average and up to 99%. This new, highly accurate prediction model can be used to automatically optimize the performance of Hadoop systems.

[Keywords] big data; cloud computing; MapReduce; Hadoop; random forest; micro-benchmark