A smart method for spark using neural network for big data

ABSTRACT


INTRODUCTION
Around the world, the number of online users is increasing at a rapid rate with the advancement of social communication and e-commerce business. Besides, a lot of users are storing their content constantly for future use. As indicated by International data corporation (IDC), digital space is projected to increase more than 44 Z.B. in volume by 2020 [1][2][3]. In the era of digital data, big data is something that can't be overlooked. Therefore recently, the big data era, different industries and governments have given emphasis on big data technologies. Since the conventional computing techniques could not provide the expected result and efficiency to manage big data. The different distributed frameworks like hadoop [4], spark [5], and storm [6] have been introduced to satisfy the prerequisite of taking care of the big data.
Apache spark is one of the most notable and broadly used frameworks because of its high performance and flexibility [7]. Apache spark has over 180 parameters with default values. The appropriate values of the parameter can be selected by the user manually while processing different sizes and types of data. The performance becomes unsatisfactory due to the inappropriate selection of parameter values. Therefore, additional tuning of the parameter is required for each particular application [8]. The users require appropriate knowledge for manual tuning of the parameters in the spark framework, however, manual tuning is very tedious due to the complex interaction between them.
As per the current practice, parameter tuning in big data is performed in 2 ways. Firstly, manual tuning of the parameter by trial and error. This process is very complicated as it requires a long time and depth knowledge due to a large number of parameters and its internal correlation with each other. To address the manual tuning problem, [9] author proposed a cost-based model for the hadoop system. However, the model needs to be persevered by users based on different policies. Secondly, self-tuning parameter when it requires. This paper proposes an approach based on a neural network to minimize the drawback of manual tuning. The research developed a self-tuning approach that can perform self-tuning of the parameter range based on the neural network model. This approach has three key advantages compared to the existing approaches. Firstly, all tasks are processed by the neural network model. Secondly, all types of datasets that consist of structured data, semi-structured data, and unstructured data can be processed. Thirdly, any volume of the dataset can be processed.
The training data has been collected for the selected five parameters by changing the parameter range and various input of data sets. The training process is only for one time to learn the machine learning model, which then can predict the numerical values for the selected parameters. The method has been implemented on a testbed that uses Dell PowerEdge R 720 server, hosting spark framework, and runs as spark nodes. The test results provide that our proposed method can perform effective self-tuning based on the neural network model so that it meets maximum resource usage capability and saves processing time. The key commitments of the method are as follows:  It has implemented an artificial neural network in the approach that processes spark jobs using its application service based on the neural network model. Hence, users do not require in-depth knowledge of the internal system function. Thus, they can save time by avoiding manual tuning.  The self-tuning facility of the approach integrates parameter range allocation. It helps to meet task deadlines and improves the overall performance of spark.  In our evaluation using spark workloads with five different input datasets, the approach achieved an average performance speedup of about 30% performance. The remains of the paper are organized as follows. Section 2, presenting the background of the study. Section 3, the related work, is discussed. Section 4 presents the details of the artificial neural network. Section 5 presents the architecture of SSNNB. The methodology is presented in section 6. Section 7 presents results and analysis-finally, Conclusions and future work presented in section 8.

BACKGROUND OF THE STUDY 2.1. Spark
In the area of big data, "Apache Spark" is the most accepted open-source platform that supports the idea of resilient distributed datasets (RDDs). The RDDs allow rapid treating of the massive size of data leveraging distributed memory. Data operation in memory is appropriate for repetitive applications such as graph algorithms and reiterative machine learning. RDD is considered as the main feature of spark. It characterizes a read-only collection of entities allocated among several machines. An RDD explicitly stores in the cache memory by the user over several machines and can be reused as the parallel operation in multiple MapReduce. RDD has the fault tolerance ability over a notion of extraction. Whenever a partition of RDD is lost, it can rebuild it since it has sufficient information regarding its origin. Though RDDs do not have shared memory construction, on the one hand, they can represent reliability and scalability and, on the other hand, a sweet-spot among expressivity. RDDs are well-suited for a diversity of applications. Figure 1 presents the spark-cluster framework [10]. A spark comprises a driver node that is equivalent to a master node and several worker nodes that are correspondent to slave nodes. The driver node manages all worker nodes through the worker node process. The worker nodes communicate with the driver node through the worker node process and manage local executors. Each application consists of multiple executors and one driver. All the jobs in an application come from the same executors. The spark context is creating by the main jobs of the application that are run by the driver process. Each of the worker nodes accomplishes one or more executor backend process during launching, and a single executor backend does managing executor instance. An executor manages a thread group that runs each of the tasks as a single thread. Nevertheless, the time of execution of a specific task in the platform of Apache depends on various factors such as input data volume, data type, CPU speed, memory size, number of nodes, configuration parameters, design and implementation of the system and so on. Based on these factors, the time of execution time of a specific job in apache spark may differ conspicuously [11]. There is more than 180 configuration parameter in apache spark that user can tune according to the need of a specific application to enhance the performance. It is the modest and most operative approach to enhance the enactment. Users tune these parameters physically by experiment [12]. At present, the parameters are manually tuned by experimentation that is not effective. It needs complicated interactions with the parameters and takes a larger parameter space. Again, these parameters must be re-tuned for various applications and clusters.
Artificial neural networks (ANN) is a mathematical processing method that can be used for both classification and regression [13,14]. The neurons make it a powerful learning model for this reason for 2527 regression analysis. It is the best choice, including multiple inputs and output data [15,16]. A neural network can predict numerical values correctly, and it can prevent overfitting easily. ANN is much suitable in several areas, including natural language and image processing, prediction as well as emotion recognition [17][18][19].

RELATED WORK
In recent years, one of the keenest research is in the optimization of the performance of big data system. However, almost all the existing researches have been done on the Hadoop platform or the framework of MapReduce computing. Starfish [9] utilizes simulation and a cost-based model to seek the required job configuration for the workload of MapReduce. AROMA [20] uses an optimization framework and two-phase ML to automate resource distribution and job configurations considering heterogeneous clouds. The authors of [21], indicated that hadoop scheduler in the heterogeneous environment, the performance reduction and proposed another scheduler named longest approximate time to end. In [22] a different work concentrated on examining the different resource consumption effects for variant set for the Reduce slots and Map. These problems have been addressed in [23], through a framework called "Profiling and Performance-based System" (PPABS), which can atomically tune the configuration of hadoop setting by deducting the requirements of application performance. Modifying the popular KMeans++ clustering along with the simulated Annealing algorithm are the main contributions of [24], which were needed to adjust to the MapReduce paradigm. Reference [23] recommends easing this issue by an engine that suggests the configurations for a new analytical job timely and intelligently. This engine is embedded in an adapted knearest neighbor (KNN) algorithm to discover the appropriate configuration based on the past job experience that is executed well. However, the research of optimizing apache spark performance is still in the beginning stage. The authors of [24], present a simulation driven forecast model to anticipate the performance of a job with high correctness for Apache Spark. Their proposed model can predict memory usage and execution time of spark systems in the case of default parameters. [25] Showed that the support vector regression (SVR) model is computationally efficient with high accuracy. According to their findings, it can be concluded that using the auto-tuning method can offer comparable or better performance compared to starfish with a fewer number of parameters.

ARTIFICIAL NEURAL NETWORK (ANN)
The scikit-learn is an essential tool since it allows only a few lines of coding and prevalent data groundwork. In order to proceed with the evaluation, the Keras wrappers need to be provided with a defined function to create ANN. In fact, the function is formulated to create a base model that is the subject of evaluation. The base model is connected with three neurons through a hidden layer, as illustrated in Figure 2. The hidden and output layer is activated with ReLU and softmax activation functions. Furthermore, an efficient optimizer "Adam" can be used to update network weights iteratively based on training data. The object in the Keras wrapper, known as KerasRegressor, is used as a regression estimator in the scikit-learn. The function of ANN is then created immediately to pass parameters including the batch size and epochs number along with the function of the model, both of which are set to default. Furthermore, a process of arbitrary number creator with a constant arbitrary seed has been initialized to compare the consistency of the models. In this research, the process of arbitrary number creators is repeated for the evaluation of each model. A neuron takes inputs, does some math with them, and produces an output. A simple neuron looks like what is shown in Figure 3.  Three things are happening here. First, each input is multiplied by a weight: Next, all the weighted inputs are added together with a bias b: ( 1 * 1 ) + ( 2 * 2 ) + ( * ) + Finally, the sum is passed through an activation function:

Activation functions ReLU and softmax
Rectified linear unit (ReLU), is a recently popular activation function in neural networks [26][27][28]. It is well-defined as ( ) = (0, ). One of the advantages of the function is, it is also non-linear and can run backward for error minimization. Additionally, the function activates multiple neuron layers. Figure 4 shows the rectified linear unit (ReLU) activation function.
Softmax is a type of logistic function in mathematics. The softmax function accommodates outputs of each unit in between 0 to 1, displayed in a K-dimensional vector of random real numbers [29][30][31]. The function is used as an activation function due to its categorical probability distribution characteristic. The function is used for any number of classes and able to estimate the probability that any of the tested classes are true. The softmax function provided by

SSNNB FRAMEWORK
The spark configuration parameters are tuned by the predicted values from the self-tuning approach SSNNB, which architecture is shown in Figure 5. SSNNB considers two input values, which are dataset size and execution time. From Figure 5, there are several blocks such as:  Training data is obtained from a database  The data has been received, and the model is generated by the "Model Training" block  Generated model has been stored in a fixed location by the 'Store Model on Disk' block  "Predicted Parameter Value", this block provides the predicted optimum parameter value  Finally, the predicted optimum values are received and updated in the "Spark System" block The selected five parameters are shown in Table 1. The column 'Default value' displays the default parameter values, and the column 'Range value' displays the range of the selected parameters in the spark method [32][33][34]. Self-tuning is required when processing various sizes and different types of data to minimize processing time and achieve maximum performance from spark [35]. This paper selected five predominant parameters of the spark, based on the review of the authors [36]. The notable reason is: firstly, the selected five parameters are covered, including CPU, memory and disk of the resource in a cluster. Secondly, in schedule and shuffling modules, it has a great impact. Thirdly, this parameter also has a significant impact on the machine and cluster level [37].

Data collection
Training data has been collected by the spark job, which is completed by changing the parameter and values and various dataset sizes and types. Finally, the sum of 3,000 sample data have been collected for training and testing the neural network model. For the high accuracy of the model, the normalization has been done.

Training and testing
For training, the neural network model has randomly sleeted 80% and the remaining 20% data have been used for testing. To get the best accuracy from the model, the training cycle has been repeated several times. In training, the epoch size has increased up to 250, and the model accuracy level was 97.1% and 96.7% for testing. It has observed that the accuracy has been increased during training and testing with the number of epochs is increased. It is observed from Figure 6 that, after 250 epochs, there is no significant improvement in both model accuracy and model loss.

Test bed
The SSNNB approach has used the Dell PowerEdge R720 server as a testbed. The server is equipped with Intel® Xeon® CPU E5-2650 version 2.0 @ 2.60 GHz 16-core processor and 32 GB PC3 memory. The operating system was Ubuntu, and the version was 17.10 and hadoop version 2.8.1 with spark version 2.2.0. The self-tuning task can be run using an independent or a different VM. As listed in Table 2, the spark job is run with five different datasets ranging from 5 GB, 10 GB, 15 GB, 20 GB and 50 GB, which is collected from the Puma Benchmark suit. In order to facilitate a fair comparison with the default system, the five parameters are selected. Datasets ranging from 1 GB to 5 GB have been used during training, and the rest of the datasets up to 50 GB have been used during the evaluation process.

Artificial neural network model development
In ANN model development, the ML libraries are required, which are imported from Keras. One of the well-known libraries of Keras and behind it TensorFlow, is supported. Keras framework is much easier to use instead of directly using Tensorflow. In some respects, the variables X, Y, and Z are used to load and store the train and test data. Thus, X and Y comprise two training data; execution time and dataset size obtained by manual parameter tuning. Similarly, the variable Z holds the size and time of execution of the test data. The test dataset, as well as the train, are filled into the system. The necessary hidden layer is built from the base model. Furthermore, functions for activation are also added. In the base model, the dropout  Figure 6 shows that beyond 250 epochs, accuracy or loss is not substantially improved. The model will be saved for every parameter. It has five models built by modifying the Y with five distinct parameters, which is illustrated in Figure 7. Figure 7. ANN models to predict the optimized parameter (P for the parameter) Figure 8 represents the computational time of spark work independently for both default design and SSNNB. For various sizes of input datasets. It has been seen that the time necessary in executing spark job is essentially lower with SSNNB rather than the default parameter boundary settings free of information size in the scope of 5 GB to 50 GB.

Ability of Self-tuning and execution time speedup
To assess the ability of the SSNNB framework, a spark job has been evaluated for five distinct sizes of input data extending from 5 GB, 10 GB, 15 GB, 20 GB to 50 GB independently with both the SSNNB and the default design. The predicated ideal parameters value has been introduced in Figure 9. Referring to Figure 8, with the default configuration, for dataset sizes of 5, 10, 15, 20, and 50 GB, spark takes 8.33, 14 Tables 3 and 4, it can be seen from the result that the SSNNB approach achieved an average 30% faster compared to the default configuration with independent dataset size. 8

CONCLUSION
This research introduces a novel way to deal with the self-tuning approach for spark predominant parameters to speed up the execution while handling big data, including the different sizes of the dataset and variety of data. Moreover, estimation of optimum parameter value for five selected parameters is enabled by the approach. The approach received the optimum value from the neural network model and updated it in the spark system before processing. Dell Poweredge R70 server, including five different datasets, has been used in the procedure. The performance of SSNNB is compared with the default configuration, and the result shows the performance improvement is 30% on an average. It has also been observed that the performance was improving while increasing the dataset size. Future research will focus on how to select a more appropriate number of parameters and use better servers to obtain better outcomes. Metaheuristics algorithms are to be considered for this optimization. "spark.driver.cores" = 4 "spark.driver.memory" = 4g "spark.executor.cores" = 30 "spark.executor.memory" = 6g "spark.reducer.maxSizeInFlight" = 80m Spark System Int J Elec & Comp Eng ISSN: 2088-8708 