A Survey of Machine Learning Techniques for Self-tuning Hadoop Performance

ABSTRACT


INTRODUCTION
The purpose of this paper is threefold: To provides a brief description of the concept of machine learning, big data and Hadoop system. To present a systematic analysis of existing techniques in terms of performance, parameters, dataset and system configuration. To propose a promising technique using deep learning algorithm for improving the Hadoop system performance in processing big data.
A roadmap of this paper is given in Figure 1. In Section 1 and Section 2, Hadoop system with MapReduce (MR), Hadoop Distributed File System (HDFS) and YARN have been discussed. Then, a discussion about big data and V's, classification of machine learning and existing machine learning algorithms is presented. In Section 2.2, critical issues in Hadoop system are discussed. In Section 3, a promising application of Deep Learning algorithm to improve the Hadoop performance is discussed. A summary is presented in Section 4.

Hadoop system
The Apache Hadoop framework mainly consists of three component: MR, HDFS and YARN. The role of MR is data processing. The role of HDFS is to manage storage which is done by breaking a file into multiple blocks and copying each of them into three different servers [1]- [4]. MR is considered as a programming process that comprises of JobTracker (manage the task) and TaskTracker (run the task). Figure 2 shows the high-level architecture of Hadoop. Hadoop works with two nodes namely master node and slave node. Under the master node, there are TaskTracker and JobTracker in MR Layer and NameNode and DataNode in HDFS layer. Under slave node, there are TaskTracker in MR layer and DataNode in HDFS layer. TaskTracker of Master Node contacts with the JobTracker and JobTracker contacts with the slave node TaskTracker. NameNode contacts with all DataNode.

Hadoop file system (HDFS)
HDFS is designed for reliable and efficient storage of very large datasets [5]. Hadoop enables scaling to a large number of hosts and data partitioning in many nodes for performing computations. Storage of file system metadata and application data are done by separately in HDFS [3]. HDFS stores metadata in a dedicated server called Name node. Application data is stored on another server called Data Node [6], [7].

MapReduce
MR is a programming process which contains two functions namely map and reduce [6], [8]. In terms of designing, MR comprises of three main components: programming, storage designing, and scheduling. In programming step, map function receives input as value and key from user and passes the output to reduce function for further processing and generating the result [8], [3]. The reduce function processes the output from the map function through shuffling, sorting and merging of data [7], [9].

MACHINE LEARNING TECHNIQUES AND CRITICAL ISSUES 2.1. Classification of machine learning algorithms
Machine Learning (ML) simply refers to the intelligence of a machine where the machine can provide decision [16], [17]. It has greatly impacted information science in the sectors of prediction, classification, image recognition, computer vision, speech processing, natural language understanding, neuroscience, health, and IoT (Internet of Things). ML algorithm is required to process information from the verity of data within a limited time duration. It is challenged by the emergence of big data [18], [7]. Machine Learning is categorized into supervised, unsupervised, semi-supervised and reinforcement learning [19]. Most popular machine learning algorithms are shown in Figure 4. a. Supervised Learning: Supervised learning is skilled by labelled instances, like an input as the expected result is known. Supervised learning delivers dataset comprising of both structure and labels. b. Unsupervised Learning: Unsupervised learning conducted data where no previous labels and its aim is to discover data and trace similarities among the objects. This is a technique of exploring labels since the data itself. Unsupervised learning functions well on the transactional dataset. c. Semi-supervised Learning: Semi-supervised learning and supervised learning can use the same application but semi-supervised learning can do together with labeled to unlabeled data because of learning.
d. Reinforcement Learning: Reinforcement learning is frequently used in navigation, gaming and robotics.
This learning method which connects by a dynamic situation in where it has to perform a particular aim except a trainer explicitly saying it whether has approached its aim [20].

Analysis of critical issues in hadoop system
The critical issues in Hadoop system for big data processing are depicted in Figure 5. , is a model to tune the configuration parameters automatically in optimizing the performance for a specific application that runs on a particular cluster. The RFHOC establishes two models based on the random-forest approach that work with the map and reduce stages in a similarly. Five Hadoop programs namely wordcount, terasort, sort, Adjlist and Inverted-Index are used in the evaluation of RFHOC. The evaluation shows that the performance has been speeded up by an average factor of 2.11times where the maximum speed can run up to 7.4 times compared to cost-based optimization (CBO) approach [1]. b. Support Vector Regression Support Vector Regression (SVR) is also one of the most popular algorithms of ML. SVR model is considered as one of the best among ML approaches in terms of accuracy and computational efficiency. SVR auto-tuning mechanism has integrated machine-learning performance model and intelligent search algorithm for an effective exploration of parameter space and efficient training models. The SVR model performance was measured in two programs: sort and wordcount and compared Starfish model. It was shown that the SVR model performance increment was 39% while Starfish model performance increment was 13% when it is analyzed in sort programs. But in wordcount program, the Starfish model performance increment was either similar at 40% or slightly better with a rate of 5% [21]. c. Support Vector Machine In another model known as Automated Resource Allocation and Configuration of MapReduce Environment in the Cloud (AROMA) the allocation of resources and configuration of parameters are automated through two-stage optimization framework and ML. This model focused on the way to reducing the cost of big data processing in Hadoop system. The result depicts that the cost of processing 10GB data with AROMA auto-tuning is 36 cents using 5 medium VMs (Virtual Machine) where the one without AROMA auto-tuning is 51 cents using 6 medium VMs. It also shows that resource allocation under AROMA mechanism can cost less compared to the default one. On average AROMA's cost efficiency is 25% [22].

Int J Elec & Comp Eng
ISSN: 2088-8708  d. K-means++ Unlike the AROMA mechanism, many other mechanisms just focus on improving the performances by reducing the time of data processing. Profiling and Performance Analysis-based System (PPABS) uses two-phase (Analyzer and Recognizer) framework that operates on K-means++ clustering for analyzing and classification approach for recognizing the jobs. The experimental results show that the processing time for Big Data has been reduced in TeraSort and WordCount methods. In an experiment with 10 GB input data set, the usual accomplishment time for TeraSort and WordCount decreases by 38.4% and 18.7% respectively. The reason for this mismatch in performance is characteristics of various jobs [23]. e. Tree-Based Regression This approach has two phases first one is prediction phase and the second one is optimization phase. In the prediction phase, the performance of MapReduce job is estimated and in the optimization phase, a search for on an average optimum configuration parameter is made by invoking the predictor repeatedly. It is reported that this approach can help the user to increment the performance regarding 2 to 8 times better than prediction phase [24].

Parameters
Above algorithms have been used parameters in Table 1.
Parameters are the main factors that play an important role in Hadoop system for performance improvement. The limitation of default Hadoop system is that the parameters are fixed at default values. Among the 30 effective parameters different models used different parameter configurations. The Table 1. shows some most effective parameters, which were used in optimizing Hadoop performance [25]- [28].

System configuration
The system configuration has been used in above ML algorithms for improving. The different models have compared the performance with different system. For example, Random-Forest model has compared its outcome against CBO based approach and the configuration has been done in a similar manner also. It has used 10 Sugan servers prepared with Intel-Xeon CPU-E5-2407 2.20GHz and quad-core processor and 32GB PC3 memory connected through gigabit Ethernet [1]. SVR model has used HiBench benchmark for WordCount and Sort benchmark while it has used two clusters namely SandyBridge (SNB) and ZT cluster. SVR performed the experiment on a server which is a dual-core IntelR CoreTM i5-2540M processor running at 2.60GHz and 4GB main memory [21]. On the other, hand SVM model was implemented on settings of 7HP Pro-Liant BL460C G6 blade server together with a HP EVA storage area network that comprised of 10Gbps Ethernet and 8Gbps Fibre/iSCSI dual channels. The small and medium VM (Virtual Machine) used were contained with 1vCPU, 2GB RAM and 50GB hard disk space and 2vCPUs, 4GB RAM and 80GB hard disk space respectively [22]. In addition, K-means++ model has used a cluster that contains five DataNodes and one NameNode. The NameNode and DataNode run on CPU of 2EC2 Compute, and 1EC2 Compute Unit, the memory of 300GB and 200GB respectively [23]. Besides, Tree-Based Regression machine learning algorithm has evaluated its performance on 8 nodes Pdefault setting where each node contains eight Intel i7-4770 cores, 32GB RAM and 2TB disk space [24].

PROSPECT OF APPLICATION OF DEEP LEARNING ALGORITHM TO HADOOP PERFORMANCE IMPROVEMENT
Hadoop is an integral part of processing big data. Hadoop performance is an impediment in getting efficient service as the parameters are not self-tuned. Different ML algorithm has been proposed to improve the performances by allowing auto-tuning of the most effective parameters. The analysis of performance with different ML algorithms shows that the self-tuning has improved Hadoop system performance in comparison with default parameter configuration. However, there is a need for a new model to further improve Hadoop system performance with respect to speed and accuracy. Deep learning has been adopted with most popular Theano [29], Tensorflow [30], Cafee [31] library in many sectors of big data processing and it was found to result in improved performance. Deep Learning algorithms are used in processing big data in many giant tech companies including Google, Facebook, Amazon and so on. The authors feel there is a scope for applying deep learning algorithms for self-tuning and improving Hadoop system performance [7].

CONCLUSION
In this paper a brief review of the concepts of big data, Hadoop system is presented, self-tuning of Hadoop parameters and ML algorithms. Self-tuning of Hadoop system parameters using ML algorithms has been found to improve performance compared to default parameter configuration. The prospect of the application of deep learning algorithms for self-tuning in Hadoop system to improve speed and accuracy of performance is proposed by the authors.