Straggler handling approaches in mapreduce framework: a comparative study

ABSTRACT


INTRODUCTION
With the excessive growth in information and data, their analysis becomes a challenge and more complex due to the increased volume of structured and unstructured data that are produced by the internet of things (IoT), social media, multimedia etc. Application such as MapReduce is a fault tolerant, scalable and simple framework for data processing that enables its users to process these massive amounts of data effectively [1,2]. MapReduce is a significant model of preparing and generating a set of enormous information. This is because; it gives simple utilization environment, offer solution to ad hoc and to misses like Data sorting, Web indexing among several others. MapReduce is utilized in Big Information Applications in bigger Companies such as Yahoo and Google among several others.
The MapReduce is unlisted as a section of one structure or the other. The reason for creating stragglers is the diversity in accessibility in the CPU, I/O discord or network traffic. When the map and reduce are completed, that is when the MapReduce Framework is accomplished [3,4]. In MapReduce Framework the job is not accomplished till very reduce and map undertakings are completed. Moreover, the quantity of the stragglers weakens with the wide-range of the time occupation [5][6][7][8].
In a heterogeneous environment, some compute nodes are faster than the other. Slower compute nodes are called stragglers node and these fast nodes will finish their tasks early and wait for the stragglers to finish. Sometimes nodes fail due to hardware or software failures. Therefore, it is significant to detect stragglers in an early stage.to avoid performance discarion in the systems Nowadays organizations with a large amount of data have complexity in processing and analyzing using traditional database management systems. By designing MapReduce, Google has made millions of users around the world find content from millions of pages within a hundredth of a second; therefore, the bulk processing problem has become a big challenge and their analysis technologies are changing rapidly. On the other side, stragglers are well recognized as a major bottleneck in big data processing and they can have significant impacts on big data processing. This work aims to evaluate five stragglers identification methods: Hadoop native scheduler, longest approximate time to end (LATE) Scheduler, Mantri, MonTool, and Dolly The performance of these techniques was assessed based on three benchmarked methods: Sort, Grep and WordCount.
The remainder of this paper is structured as follows: The second section depicts MapReduce and struggles. In the third segment, five stragglers identification approaches are presented. In the forth segement, expermentalresults are presented This section is followed by the conclusion and future work of this study.

MAPREDUCE FRAMEWORK AND STRAGGLERS
MapReduce is the matching information preparing perfect models proposed for considerable information preparation on bunch-based figuring designs [9]. This system is used inside in facilitating data mining, search applications and machine learning at server centers. The philosophy needs to deal with broad scale web search applications. It was at first suggested by Google to deal with tremendous scale web search applications. The focus is licensing programmers in extracting from problems such as parallelization, booking, allocating, thus allowing them to focus on developing applications. In modern organizations and enterprises, there are four factors, including processing, storing, visualization, and analyzing large data. MapReduce can run the applications on a parallel cluster of hardware automatically. In addition; it can process terabytes and petabytes of data more rapidly.
Recently, it has gained popularity in a wide range of applications due to its ability to provide a highly effective and efficient framework for the parallel execution of the applications, data allocation in distributed database systems, and fault tolerance network communications. As illustrated in Figure 1, parallel map assignments are run as one input data as a gathering of <key, value> sets which is that is divided into fixed produce and size blocks transitional output. The model of MapReduce Programming comprises of information preparing capacities: map and reduce.

377
In the Map-phase, when the user requests to perform a job, the tasks are sent to the Map machines to run. The Combiner reduces the amount of data transmission in the network in the Reduce phase. Sort or Merging part is a part of the Reduce-phase. The time is used to integrate Map outputs from different nodes and this integration is considered as a Reduce time. The Reduce-part is the last step to run the job in a MapReduce way. The effect of each part of this process on runtime is different and to estimate the end time of each job, appropriate weights (the effect of each part on execution process is obtained from the ratio of the time of each part to the total time) should be used for each part. For more details see [9].
Stragglers are the endeavors that set aside longer effort in execution other than comparative assignments. Behind the assignment, there are various purpose to set aside longer effort, such as, imperfect machines, proportion of information to process, framework blockage, heterogeneity among equipment, and contention for the current resources [10,11]. In any case, it is not significant for an assignment to be slower all around its execution. In addition, in the event that one task running slow on a given machine, it isn't significant for the completely future and present assignment to run slower on that particular machine. In dealing with Mantri, it is essential to keep in mind three primary mechanisms: - In the event it is found that the expected remaining time exceeds the normal runtime, then the process will be restarted up to thrice. - In the event that the resource measurement decreases undesirably, then scheduling of a speculative duplicate takes place. The expected remaining time (trem) and the normal runtime (tnew) are estimated as illustrated in the following algorithms. -term = (telapsed * d/dread) + twrapup tnew = processRate * locationFactor * d+schedLag.
Many reasons for such stragglers to occur including load imbalance, scheduling inefficiencies, data locality, communication overheads hardware heterogeneity [12,13]. There have also been efforts looking to address one or more of these concerns to mitigate the problem [14][15][16]. Sesipite all of these prior efforts are important and useful in overcoming this problem, we believe that a rigorous set of analytical tools is needed in order to better understand the consequences of stragglers on the performance slowdown in big data [17,18].

STRUGGLERS IDENTIFICATION TECHNIQUES
Various methods are proposed for stragglers identification. In this section, five strugglers identification methods including Hadoop native scheduler, LATEscheduler, Mantri, MonTool, and Dolly to assess their sutabiltuy for stragglers identification.

Hadoop native scheduler
Hadoop native scheduler allocates a progression score somewhere in the range of 0 and 1 to each task. For lessen assignments, to execute is isolated into 3 phases, every one of that records for 1/3 of the score. In the case of a map, the progression count addresses the proportion of input information [19]. The stages are: -At the first stage (copy stage), the assignment gets the map output. - In the second stage (sort arrange), the map output is arranged by key. - The decline stage, the customer-characterized capacities are connected to the rundown of guide yield. For every task, it grasps up per minute and the assignment headway score is found to be short of typical for the group less 0.2, the unit is stepped as a straggler. In addition, at every stage, the score is the capacity of data prepared. At that point, Hadoop figures the normal procedure value to each classification of assignments for a description of a farthest point for theoretical completion. Task below the farthest point are recognized comparatively lessen and the ties between them are controlled by data locale. In any case, the scheduler ensures that just a single theoretical duplicate of every one assignment is running at whatever point, however reschedules scores while considering their authentic progression. A few of the pivotal suspicions that break in virtualized, heterogeneous bunches are--In decrease activity, the sort, copy, and lessen phases every one proceeds around 1/3 of the total time. -Nodes expected by Hadoop can execute the job at a harshly equivalent rate. In addition, these are homogeneous, while undertakings propel at a reliable rate all around time. Task in a similar characterization performs relative proportion of work. This presumption particularly fails for reduce undertakings in having tremendous modification in keys apportioned to an explicit reducer. It is strong to heterogeneity of node considering that that re-impelling only slows the assignments and only a few amounts of undertakings. However, LATE Scheduler has the following weakness [21]: -Time required by LATE Scheduler is found to be higher, usually one minute, in starting evaluation, prior to an activity being stepped as a straggler. - In the final activity, time for a task is determined using the determined out-progression rate middle value of the out-headway rate beside the present progression rate, the ending time foreseen is inclined to mixed up. -Enormous undertaking will tend to take to a greater degree of risk than the rests to process, along these lines it is possible to be labelled as a possibility to be conjectured bringing about squandering resources. If there is no explanation of the temperate nature of the acknowledged stragglers searched, the straggler assurance may be wrong. This essentially prompts much response time.

Mantri
Mantri is superior concerning outliers since it uses continuous progression reports; Mantri follows up and distinguishes the stragglers from the start. Early operations permit the assets used by coming assignments and accelerate the generally the employment [21]. Mantri is given an opportunity to upgrade over previous work that just duplicates the slouches by the acting dependently upon the effects and the asset and opportunity cost of activities. It utilizes the accompanying methods: -Restarting exception tasks aware of advantage imperatives and work unevenness attributes, -Ensuring yield of assignments subordinate upon a cost benefit analysis, and -Network careful position of assignments. Mantri perceives focuses at which assignments are unfit to gain ground at the customary rate and executes focused on results. Mantri spots undertakings subordinate in the area of their information sources,The controlling gauges that perceive According to Mantri previous anomaly relief outlier, plans are effects to mindfulness and asset discernment. Remarkable exercises are required for various reasons. Key Mechanism in Mantri: -A theoretical copy is scheduled just if the proportion of asset obliged is diminished along these lines. At most three could be three copies of similar errands (tasks). -If the anticipated residual time for an errand is much higher than the ordinary running time of the assignment after restarting, the assignment is started again until the highest value is achieved. The approximation of trem and t can be written as: where d= entire information to process, fear=data officially prepared and twrapup= time to recall results

Mon tool
It assembles information on the errands by succeeding framework calls and investigating them. In addition, MonTool discovers the stragglers, stragglers effects and their reasons using the data. The daemon running of Montool at masters gives a structure pursue call from all workers center point. Moreover, it takes the state machines for various endeavors in figures and running likeness state machines; score. Montool sifting approach enter rest mode for 2 seconds and in the in returning active mode, it set up the framework calls gathered in 2 seconds [22]. The master receives the sent information at that point. Then the Montool gathers information concerning the assignment by succeeding the framework calls and looking at them. In addition, it produces framework call state machines for all framework calls every 10mseconds by gathering plate/organize readings and compose framework calls for guide and diminish forms made by Hadoop. Similitude (Similarities) Score for two procedures is determined as: where Ni 1 furthermore, Ni 2 stands for the quantity of progress state for ithpair state for first and second separately. The Threshold for straggler is arranged to 85 percent to such an extent that on the off chance one procedure makes <85percentageframework calls, it is patterned as a straggler while the hypothetical copy is launched. Confinements: -Associating framework calls can't be accomplished without information about the keys. In addition, the case of keys is consistently unavailable in Hadoop. -It acknowledges all the maps or reduces assignments work upon comparable estimated remaining tasks and get to data in a comparative pattern. In any case, this supposition lessens tasks as data size perused by lessen assignments might be particular for each undertaking 3.5. DOLLY DOLLY oversees this system in managing the stragglers in a proactive manner .The primary challenge of cloning was coming up with intermediate information transfer effective such as maintaining a strategic distance from various assignments downstream in the activity from battling for the equivalent upstream yield. Cloning of little occupations can be accomplished with a couple of additional assets in view of the overwhelming tail circulation of occupation estimates [23,24]; most of the employments are little and can be cloned with nearly nothing overhead. As restricted to hold up and endeavoring to foresee stragglers, then it takes theoretical implementation to its outrageous and dispatches various clones of all undertakings of the relevant occupation and simply utilize the consequence of the first clone to finish [25]. Intermediary information access with Dolly characterizes its methodologies for moderating dispute while getting to intermediary information from different map procedures completing at the same time. Table 1 provides comparison between stragglers identification methods used in this work. -Contention avoidance cloning (CAC): Here when an upstream assignment clone completes, the yield is directed to precisely a single task on the downstream order for clone-clone collection.
Ѱ (n, c, d) =Probability [n upstream errands of c clones with > = d clones per collection.] where, p =likelihood of an errand straggling as: Dolly characterized likelihood for occupation straggling with CAC as: -Contention cloning (CC): After the completion of the upstream clone assignment, all downstream errands read the clones' upstream output, which eases the issue of conflict. Dolly characterized likelihood for task straggling with CC as: -Every downstream clone waits for an ideal period (ώ) to see whether it can catch an elite duplicate of intermediary information. In the event that the downstream clones do not get its restrictive duplicate In Table 1, the metric for speculative execution for the five straggler detection techniques were found to be different for each one the techniques. However, this was not the case for the Cap on number of speculative executions. All the techniques did not exhibit any characteristic for this category. The data processing technique was found to be similar for Hadoop native scheduler, LATE and Montool. The heterogeneity among network nodes were found to be similar for all techniques except the Hadoop native scheduler technique. This was also found to be the same case for priority wise scheduling.

EXPERMENTAL RESULTS AND DISCUSSION
To assess the process of diagnosing straggler tasks and assigning them to other nodes i.e. speculative execution for the Hadoop native scheduler, LATE scheduler, Mantri, MonTool, and Dolly, three benchmarked methods are used which include Sort, Grep and WordCount. We manually slowed down 8 virtual machines (VMs) in a cluster of 90 with background processes to simulate stragglers.-faulty nodes. The other machines were allocated between 1 and 10 VMs per host. As our workload, we evaluated the work with a "sort" task in a total dataset of 40 GB and each of 128 MB for each host. Each job has 575 tasks on the map and 510 reduced tasks (it should be noted that Hadoop leaves some free capacity for speculation and failed missions). The stragglers were generated by running four CPU-intensive processes (tightened loops manipulating 900 KB arrays) and four disk-intensive processes (FDS tasks for generating huge files in a loop). The average results of 5 experiments for the used struggeles methods are shown in Figure 2 (see in Appendix). Based on the obtained results, the LATE Scheduler methods out performed Hadoop native scheduler, Mantri, MonTool, and Dolly.
As it can be seen in the Figure 2, the results attained by the LATE Scheduler by Grep, is smaller than the results achieved by WordCount and Sort. This can be explained by considering the workload. WordCount and Sort write huge quantities of data across the network and to the computer. In the other side, Grep provides each reducer just a limited amount of bytes i.e, a count for each word. Therefore, it could be concluded that LATE Scheduler performed the best compared to other methods when Sort, Grep and WordCount are used with respectively 60%, 50% and 80%.

CONCLUSION AND FUTURE WORK
With the excessive growth in the information and data that produced by the internet of things (IoT), social media, multimedia and etc. applications, their analysis became a challenge and more complex due to the increasing volume of structured and unstructured data. MapReduce is a fault tolerant, scalable and simple framework for data processing that enables its users to process these massive amounts of data effectively. This work aimed to evaluate five stragglers identification methods: Hadoop native scheduler, LATE Scheduler, Mantri, MonTool, and Dolly using Sort, Grep and WordCount benchmarked methods. According to experimental results, LATE scheduler outperformed the other methods used in this work. We can conclude that LATE Scheduler would be efficient to obtain better results for stragglers identification. For the future work, machine learning methods can enable us to know the story behind the data, for instance, deep learning approach can be used to identify the proper nodes for running the straggler tasks and also it can provide us more information about the number of failures in different phases and correlation between different featuresto obtain more accurate results.