Performance evaluation of Map-reduce jar pig hive and spark with machine learning using big data

Santosh Kumar J., Raghavendra B. K., Raghavendra S., Meenakshi Department of Computer Science and Engineering, KSSEM, Bangalore, Affiliated to VTU Belagavi, India Department of Computer Science and Engineering, BGSIT (ACU) Deemed to be University, India Department of Computer Science and Engineering, Christ Deemed to be University, India Department of Computer Science and Engineering, Jain Deemed to be University, India


INTRODUCTION
Big data refers to data sets whose size is beyond the ability of typical database management tools to capture, store, manage, and analyze. Cloud computing and big data, two disruptive trends at present, pose significant influence on current IT industry and research communities. Cloud computing provides massive computation power and storage capacity which enable users to deploy applications without infrastructure investment. Integrated with cloud computing, data sets have become so large and complex that it is a considerable challenge for traditional data processing tools to handle the analysis pipeline of these data. Generally, such data sets are often from various sources and of different variety such as unstructured social media content and semi-structured medical records and business transactions are of large volume with fast data [1].
The Map Reduce framework has been widely adopted by a large number of companies and organizations to process huge volume of data sets. Unlike the traditional Map Reduce framework, the one incorporated with cloud computing becomes more flexible, salable and cost-effective. A typical example is the Amazon Elastic Map Reduce service. Users can invoke amazon EMR to conduct their Map-reduce computations based on the powerful infrastructure offered by Amazon. Web Services and are charged in proportion to the usage of the services. In this way, it is economical and convenient for companies and organizations to capture, store, organize, share and analyze big data to gain competitive advantages. Map Reduce is currently a major big data processing paradigm. The authors discussed about existing performance models for Map Reduce only comply with specific workloads that process a small fraction of the entire data set, thus failing to assess the capabilities of the Map Reduce paradigm under heavy workloads that process exponentially increasing data volumes. The authors discussed about building and analyze a scalable and dynamic big data processing system, including storage, execution engine, and query language. The authors mainly concentrated in the design and implementation of a resource management system, design and implementation of a bench marking tool for the Map Reduce processing system and the evaluation and modeling of Map Reduce using workloads with very large data sets [2] Spark is the 100 times faster framework than Map Reduce and hdfs in storage and processing it is also frame work like any other java framework which built on top of OS to utilize memory efficiently and the other devices of CPU efficiently particularly designed framework for big data processing. Spark has many advantages and disadvantages efficient utilizations of memory management is one of the disadvantage of spark whereas processing big data is advantages compared with map reduce framework and HDFS of Hadoop.
Flink is also a frame work for all components of Hadoop eco-system. Flink is the frame work for Streaming data, flinklatency is very less to process big data compared with Spark Flink has many advantages, it processes the data without latency like speed of light, and Memory exception problem is also solved by Flink. Flink also interact with many devices of which have different storage system to process the data, and it also optimizes the program before execution.
Big data Processing Technology like Hadoop mapreduce, flink and spark along with caching data processing engine and scheduler as shown in Figure 1. Data Processing technique like data understanding data peploration and data modeling are as shown in Figure 2. Big data Ecosystem components like pig hive spark ambari zookeeper ml lib Habase and many as shown in Figure 3.

LITERATURE REVIEW
Many author of the paper said about Apache Hadoop that it is a framework for processing large distributed data set across cluster of computers and said about scaling the cluster. Due to use of sensors across all devices and network tools of the organizations generating big data, all wanted to store and analyze without investing much cost on managing and service issue of the storage and processing want to deploy everything on cloud so that cloud management organizations will take care of it, these companies can utilize the data for analysis and extract useful knowledge out of it. Map Reduce is the framework which allows large data to be stored across all devices and processed by devices map functions will distribute the data and store across the devices where a reduce will process the query of the client it works on bases of the key value pair. Each line will be treated as key and value that is first word is the key and rest all will be value whenever client request to process the large data first client will approach the name node name node will respond to client with available free nodes after that mapper functions by client will write data to respective data nodes, and whenever client want to process the data it request to name node job tracker then job tracker will communicate to name node to get data information storage then it will assign jobs to task tracker to process the job by name nodes will process the task by their available data then one of the node will aggregate the result and give the result to client [3].
Hadoop's optimization framework for Map Reduce clusters the author of the paper states most widely used frameworks for developing Map Reduce based applications is Apache Hadoop. But developers find number of challenges in the Hadoop framework, which causes problem to management of the resources in the Map Reduce cluster that will optimize the performance of Map Reduce applications running on it. The constraints in the resource allocation process in the Map Reduce programming model for large-scale data processing for speed up performance. The novel technique called Dynamic approach for performing speed up of the available resources. It contains the two major operations; they are slot utilization optimization and utilization efficiency optimization. The Dynamic technique has the three slot allocation techniques they are dynamic hadoop slot allocation speculative execution performance balancing and Slot Pre-scheduling. It achieves a performance speedup by a factor of over the recently proposed cost-based optimization approach. In addition, performance benefit increases with input data set size [4].
Performance Evaluation of Hadoop and Oracle Platform for Distributed Parallel Processing in big data Environments The authors discussed about the Reduce data center implementation cost using commodity hardware to provide high performance Computing. Distributed processing of large data sets across clusters of computers using distributed and parallel computing architecture. And also the authors do the Performance comparison of distributed parallel computing system and traditional single computing system towards an optimized big data processing system author of the paper stated that the authors discussed about resource management system for Map Reduce based processing system for deploying and resizing Map Reduce clusters Bench marking tool for the Map Reduce processing system evaluation and modeling of Map Reduce using workloads with very large data sets and to optimize the Map Reduce system to efficiently process terabytes of data. Overview on performance testing approach in big data the author stated that many organizations are facing challenges in facing test strategies for structured and unstructured data validation, setting up optimal test environment, working with non relational database and performing non functional testing. These challenges cause poor quality of data in production, delay in implementation and increase in cost. Map Reduce provides a parallel and scalable programming model for data-intensive business and scientific applications. To obtain the actual performance of big data applications, such as response time, maximum online user data capacity size, and a certain maximum processing capacity [5]. The paper authors discussed big data and computing cloud management appliances and the processing problems of big data, with reference to computing cloud, database of cloud, cloud architecture, Map Reduce optimization techniques [6]. The authors discussed the Resource management Mappers and Reduce-based applications processing to deploy and resizing Map Reduce Bench marking applications and tool are used for the Map Reduce processing to extent the Map Reduce enactment using workloads with big data and to optimize the Map Reduce to process terabytes of data proficiently and Cost Optimizations for Workflows in the Cloud [7,8]. The authors discussed about software to expand the scalability of data analytics, Challenges Availability, partitioning, virtualization and scalability, distribution, and elasticity and performance bottlenecks for managing big data [9]. The authors said about Benchmarking a several of high-performance computing (HPC) architectures for data, name node and data node architectures with large memory and bandwidth are better suited for big data analytics on HPC h/w and Budget-Driven Scheduling Algorithms for Batches of MapReduce Jobs in Heterogeneous Clouds [10,11]. Map Reduce provides a parallel and scalable programming model for data-intensive business and scientific applications. To obtain the actual performance of big data applications, such as response time, maximum online user data capacity size, and a certain maximum processing capacity [12]. On the other paper author have discussed about the parallel processing techniques [13]. Other author of the paper discussed about performance issue with Cloud and big data [14]. The author said about tesing techniques and performance enhancement parameters [15] and the aother authors discussed about multicore architecture of Hadoop performance [16]. The author discuused about the Machine learning techniques with Hadoop may enhances the performance [17].The author of the paper said about Hadoop self tuning mapper and reducer with mland clustersof architectur and optimization of big data performance parameters [18,19].The author discussed the performance with oracle and Hadoop and said Hadoop enhances the performance [20]. The authors discussed Map-reduce execution time Big.txt input file. With cloudxlab Hadoop big data frame work [21]. The authors discussed Map-reduce execution time Ramayana text input file. With cloudxlab Hadoop big data frame work [22]. The author discussed about the AWS Costbased Optimization of Map-Reduce Programs may enhance the performance [23]. The author said about efficient utilization of mapper and reducer may enhance the performance [24]. The author discussed about Resource-aware Adaptive Scheduling for MapReduce Clusters [25]. The author discussed about performance of Pig hive and Hadoop jar file [26]. Figure 4 is the Map Reduce architectural framework for word count program where hugeinput file is split as blocks of pages and each pages split as lines and each lines spit as words by spaces to get number of words then all words are shuffled with all the data nodes mappers to count occurrence of each words in each data nodes finally using reduces combines the results achieved by each data node. Running Character Count Job in Cloudxlab hadoop jar /usr/hdp/2.3.4.0-3485/hadoop-mapreduce/hadoop-streaming.jar-input/data/mr/ wordcount/input -output letter_count -mapper mapper.py -file mapper.py -reducer reducer.py file reducer.py. The Table 1 shows the out of the character count job, which reads the input file and calculate the number of occurrences of the character and store the output in output file. Figures 4-7 shows the execution time of word count program of pig script and Hive Query. First, we create a table called doc then will load a input file after that word count query program execution which shows a time of 14 Sec to exec. Total of 20 Sec to execute a word count program for input file (14sec+ 6sec= 20 Sec). Total of 36 sec + 16 sec = 52 sec of time to execute the word count program for input file. Table 1 shows the characters and its count on mapreduce Hadoop after execution.

RESULTS AND DISCUSSION
Mapreduce framework for the word count as shown in Figure 4, huge input data is divided and given to mapper based on key value pair for the data. Then the suffle action later reducer will be used for combining the results of mapper. The word count program is given for the execution with Spark Hive and machine learning query and Execution time of 6 seconds as shown in Figure 5. The word count program is given for the execution with Spark Hive query and Execution time of 14 seconds as shown in Figure 6. The word count program is given for the execution with Spark Hive query and Execution time of 16 seconds as shown in Figure 7. The word count program is given for the execution with PIG query and Execution time of 36 seconds as shown in Figure 8.

CONCLUSION
Hadoopissoftware framework for variety, volume and velocity of data processing, companies like google yahoo and Amazon have their own framework for processing the big data also they provide cloud based big data eco-system infrastructure to store (using HDFS) and process (using map-Reduce) big data, from above results we say that Hive Query execution time is 20 Seconds, whereas pig script execution time is 52 Seconds for the same input file without machine learning and with machine learning its enhanced to 16 seconds with combination of ml and spark with hive also, we can say that word count program for given input file Hive is better than Pig, Hive enhances the execution time, from above results we can we may state that machine learning, spark with hive gives enhanced performance than hadoop mapreduce and pig spark and flink.