Proactive cloud service assurance framework for fault remediation in cloud environment

ABSTRACT


INTRODUCTION
Resiliency in cloud computing environment is a challenge due to its shared and dynamic environment. It involves a complete process of monitoring, predicting, detecting, troubleshooting and finally remediation of faults [1]. Many works have been carried out in monitoring, predicting, detecting and troubleshooting faults in cloud but the fault remediation task remains relatively unexplored [2]. Endowing the cloud with a system which can automatically remediate the faults proactively is an interesting problem. Faults are remediated using techniques such as virtual machine (VM) migration, task migration, software rejuvenation, check pointing and rollback and others [3]. These techniques can be classified as preventive, reactive and hybrid approaches. A preventive technique is triggered before a fault occurs and it is designed to restore system to a normal state proactively, a reactive technique is triggered after a fault is detected to restore the system to normal a state, a hybrid approach is combination of reactive and proactive approach [3]. Also a fault can have multiple candidate remediation techniques, where each technique has associated cost and benefits, hence selection of an appropriate fault tolerance technique is a decision making problem.
This manuscript proposes to use an expert system for decision making i.e., selecting the best remediation technique based on the nature of fault occurred. Expert systems in artificial intelligence are knowledge based systems which mimics the decision making capability of human experts [4]. An expert system has a knowledge base or rule-base, a fact-base and an inference engine. The rule-base contains domain knowledge of experts for problem solving in form of IF-THEN rules. The facts base contains set of facts that are matched against the IF part of the rule. The inference engine deduces the results by matching the rule-base to facts base [5].
A Fault remediation process is a knowledge based problem, building upon the knowledge from previous fault handling instances. The rule-base for a fault remediation system captures faulty states of system and corresponding techniques, it can be constructed from various knowledge sources such as discussion forums, logs, manuals, Bugzilla and others [6,7]. To best of our knowledge this is the first work using an expert system for cloud fault remediation. The main contributions of the work include: 1. A structured cost sensitive framework to rank a remediation technique based on multiple parameters, 2. Mathematical analysis to quantify a utility metric for each remediation technique, 3. A prototype of expert system to identify the fault type and apply the best possible remediation technique.
Formal Problem Definition, Consider that a cloud has identical Virtual Machines (VMs) running on a Physical Machine (PM). Each VM has applications or services running on it, and the VMs and hosts are subjected to performance degradation or a fault. It is assumed that there exist a fault prediction system which acts as input to the expert system. Let there be faults and set of remedial techniques which can be used to handle the faults. Then the fault remediation system be represented as 5 tuple as in (1), where λ is the fault rate and α is the availability of the system given using (2) and (3) respectively. Each fault from a fault set maps to a subset of remediation techniques , the goal of fault remediation system is to select the best remediation technique from such that it maximizes the availability α of the system. Where, Where and MTTR is Mean Time to Repair. Fault Category and Remediation Techniques, Broadly faults in cloud computing can be categorized as: VM resource insufficiency, hardware fault, PM resource insufficiency, application fault, software fault, system fault [8]. The VM/PM resource insufficiency refers to VM/PM having resource crunch with respect to memory, storage and CPU. The application faults are like synchronization deadlocks, buffer overflow, application performance degradation and many others. The software faults here are operating system or system fault, and the hardware faults is malfunctioning of hardware components and network fault is failure to communicate with other PMs or VMs. Each fault can be rectified in multiple ways, e.g., a VM resource insufficiency fault can be addressed using remediation techniques such as dynamic resource scaling of the VM, migration of few processes to another VM, migrating the VM to PM with sufficient resource, or scaling out more VMs for load balancing. The summary of mapping of fault category to remediation technique is as shown in Table 1. but in case of a false negatives i.e., when prediction system fails to identify a fault, and fault occurs then a reactive technique is applied. Each remediation technique has different impact in rectifying a fault. Accessing the impact of each remediation technique will aid in selection of the best suitable remediation technique.

PROPOSED FRAMEWORK AND IMPLEMENTATION
The proposed fault remediation can be a component of complete cloud service assurance system as show in Figure 1. A cloud service assurance system is a performance management system to monitor and manage performance and faults. The components of the proposed cloud service assurance framework are as following: -Monitoring: Cloud monitoring component measures the key performance indicators(KPIs) such as CPU utilization, memory utilization, disk utilization of physical machine, VM and applications running on VMs. Various tools and frameworks are available to monitor KPIs on cloud [9], which are stored in a database for analysis, forecasting and prediction using machine learning models. -Cloud resource forecasting: The KPIs obtained from the monitoring module can be used as historical data to train machine learning models to forecast future values of KPIs [10]. The forecasted values can be used to predict faulty state of system and initiate proactive actions. -Fuzzifier: The resource usage values or KPIs from the forecasting module is given as input to fuzzifier module. The fuzzifier does the process of transforming crisp values of KPIs into linguistic terms such as high, low, medium [11]. This mapping of continuous values to categorical values simplifies expert system decision making rules. The KPI values are mapped to different membership functions to check which membership function fits best. It was observed that a trapezoidal membership function fits best to the data. -Fault Remediation (an Expert System): The fuzzifier module provides discrete values to the remediation module, which is an expert system responsible to identify the category of fault and select the best remediation technique. The various other system parameters like hardware error, application error, system error from the system logs, prediction modules are provided as input to the expert system. The fault remediation expert system rules are like traditional IF-THEN rules, which has antecedent and consequent part. The antecedent part consists of one or more input parameters with comparative values, the consequent part of rule consists of the category of fault. Once the category of fault is identified a utility value is calculated for each candidate remediation technique using (4). The knowledge base/rule-base of the proposed expert system consists of rules like-An expert system is implemented using PyKE release 1.1. PyKE is python knowledge engine, which can be used to build expert system for decision making where each problem has multiple solutions [5]. PyKE supports multiple fact-base, rule-base and an inference engine. The fact-base contains set of universally true statements. The rule-base consists of backward chaining rules. The inference engine applies rules to facts to generate additional facts to prove the goals through backward-chaining. The expert system uses heap datastructure to implement a priority queue of remediation techniques based on the utility values calculated using (4).

Analytical model for utility
This section discusses an analytical approach used to select the best remediation technique for a fault. The proposed approach defines a utility for each technique such that higher the utility value the better is the technique. The utility value is based on parameters such as i) Impact of remediation technique ii) Overhead of remediation technique ii) Severity of fault and iv) Priority of an application, as given in (4). (4) Where, -The posterior probability of system being in a stable state after applying remediation technique -The overhead of remediation technique interms of availability -Severity of fault is degree to which the fault can negatively impact the system. It is determined by the system administrator/expert. More the severity value the more critical the fault is. In this study the severity values for faults are defined as low (0.3), medium (0.5) and critical (0.7). Further the hardware faults, VM/PM resource insufficiency are assigned critical value (0.7). Application fault and Software fault are assigned medium (0.5) and network fault is assigned low value (0.3).
-The priority of an application is defined based on nice value in Linux based system. It ranges from 0 to 19, lower the value, higher is the priority of application.
The is posterior probability of achieving a stable system state after remediation technique r i is applied and is computed using Baye's rule as follows [12].
Where r i is the i th remedial technique, Normal_state is stable state of system where fault is rectified. | is the likelihood i.e., probability of remedial technique r i lead to normal state. | is likelihood i.e., probability of the system going to normal state after fault type fault occurs, is the prior probability of normal system, is the prior probability of remedial technique, is the prior probability of fault type, the likelihood and prior probabilities are initialized from the observation of the system in past.
The overhead of remediation technique is defined as ratio of expected availability time by the actual useful computation time, calculated using (6).
Where is the duration the system is expected to be available as per Service Level agreement (SLA), e.g., availability of 0.9999. The is useful computation time of the system i.e., the time system spends only doing useful computation, which is given as in (7) and illustrated using Figure 2. It shows execution time line of system, where is system uptime, is downtime during which the system is unavailable and is fault preparation time i.e., is a fraction amount of that the system spends in preparing the system to handle faults, e.g, for checkpointing technique it involves installing checkpoint, for migration it would be copy time from source to destination etc.

Figure 2. Execution time line of a system
The useful computation time over the entire lifetime of a system can be calculated as -availability time minus the preparation time with probability density function integral over period . The lower bound of the integral is as it is the time system spends preparing for fault doing no useful computation work and upper bound of integral is over entire lifetime of system i.e., ∞.
The integral in (7) cannot be solved analytically because any integrals from 1 to ∞ of (1/t) cannot be expressed interms of polynomials [13]. Hence using first order approximation by approximating the denominator t with MTBF, d with D total downtime of the system [14], the following equation is derived.
∫ Considering exponential probability distribution for fault rate [15] with probability distribution function p(t)=λ , where λ = the equation becomes ∫ Solving the above equation (8) Thus (8) gives actual available time excluding the time system spends in preparing for fault handling. Where is the preparation time for different remediation techniques as given in Table 2, the preparation time is actually time wasted as it is not contributing to the system throughput. Thus given and the overhead is then calculated using (6). Using the overhead, application priority, fault severity and the utility of the remedial technique is computed using (4).

EXPERIMENT AND PERFORMANCE EVALUATION
The measurements for downtime, MTBF, availability and remediation preparation time given in Table 2 are based on the work done in [16][17][18] and are used as base values for the experiments in this work. The experimental setup consists of an Intel Xeon Server with dual 6 core 2.4 GHz Processor, 128GB RAM, configured with Openstack Pike [19] cloud platform and hosting three VMs. The resource insufficiency anomalies like crunch CPU, memory and disk [20] were inserted by running workloads. The Nagios monitoring software is configured to monitor the CPU Utilization, memory utilization and disk utilization metrics of the three virtual machine. An ensemble forecasting module is developed to forecast the future resource usage values so that the errors are identified before they actually occur, so as to initiate proactive actions. Further these values are converted to high, low and medium values by the fuzzy module. An expert system is configured and designed using Pyke 1.1, which receives the fault information from the each VM, computes the utility value and triggers corresponding remedial action. The Table 3 shows the utility value calculations by the expert system for all the candidate remedial techniques for VM resource insufficiency fault and same is represented in Figure 3. It is observed that process migration technique with highest utility value is selected as best remedial technique by the expert system. A process migration involves migrating one or more processes to other VM to reduce the resource crunch on VM experiencing the fault. The utility value calculations by the proposed expert system for all the candidate remediation techniques for system fault is as shown in Table 4 and is represented using Figure 4. It is observed that VM restart technique with highest utility value is selected as best remedial technique by the expert system. Restarting the VM will restore the system to stable state. The Table 5 shows the utility value calculations by the expert system for all candidate remedial techniques for an application fault and same is represented in Figure 5. It is observed that software rejuvenation technique with highest utility value is selected as best remedial technique by the expert system.      Handling faults proactively by the remediation system is expected to improve the availability α of the system as it reduces the number of failure by handling the faults proactively, thus decreasing the failure rate λ and increasing MTBF. The improved availability of system can be given as in (9).
Where Where , are availability, MTBF, MTTR and failure rate for the proposed proactive fault remediation system. The efficiency of a system is defined as percentage of faults it was able to handle proactively. For example a 10% efficient system will be able to prevent or handle proactively 10% of faults occurred. In such 10% efficient system the number of faults are reduced by 10% thus reducing the failure rate, increasing MTBF accordingly. In this study to analyse the effectiveness of the proposed system, a exponentially distributed random fault rate is generated and the useful computation time given by (8) is calculated for system with no fault handling mechanism i.e a reactive fault handling system, system with 10% efficiency, system with 40 % efficiency and system with 60% efficiency. The useful computation time is plotted against the fault rate as shown in Figure 6. It is observed that a proactive system has more useful computation time compared to reactive fault management system and as efficiency of proactive system increases the useful computation time increases, and the system is able to handle faults proactively.
Similarly the overhead given by the (6) is calculated for exponentially distributed random fault rate for the reactive and proactive system with different efficiencies. The overhead values are plotted against the fault rate as shown in Figure 7. It is observed that as fault rate increases the overhead increases and overhead of reactive fault management system is more than proactive system, also as efficiency of system increases the overhead incurred is less as system is able to prevent the faults from occurring. Finally the MTBF and availability of the system are calculated using (9) for exponentially distributed random fault rate. The availability values are plotted against increasing fault rate as shown in Figure 8. It is observed that the availability of the proactive system is more than reactive fault management system and availability of the system increases with increase in efficiency of the system, as efficiency of expert system improves the availability increases. Based on the experiments and performance evaluation discussed, it is observed that the proposed expert system improves the availability of the system over a reactive fault management system.

RELATED WORK
This section provides summary of previous related work. A knowledge management system [21] was designed for preventing SLA violations. At first the approach [21] formulates behavior of knowledge management system to prevent SLA violation and then evaluate different techniques to model the behavior such as situation calculus and case based reasoning. Case based reasoning was found to be effective and was designed and tested. In case based reasoning for each case corresponding remediation approach is initiated. An approach combining SLA prediction and cross layer adaptation was designed to prevent SLA violation [22]. The architecture for the proposed concept is given but it does not provide any details on implementation techniques used or experimental results.
A study [23] on techniques to be taken upon SLA violation defines five escalation levels: VM reconfiguration, migrate application, migrate VM, physical machine ON/OFF, outsourcing to other cloud provider. Depending on level of SLA violation suitable technique is initiated starting from the lowest level to highest. A middle layer [24] is defined to achieve fault tolerance which can tolerate node failure. It uses techniques such as replication and checkpointing for fault tolerance. CloudPD [25] is cloud problem determination and diagnosis system which uses an online learning approach to deal with frequent reconfiguration and other high tendency faults. It diagnoses anomalies based on pre-computed fault signatures and allows remediation to be integrated in an automated fashion. A proof-of-concept fuzzy control module for scaling up and scaling down of the resources without violation of SLA is proposed [26]. It uses imprecise information form the system administrator in form non numeric linguistic variable for example moderate, high and low. These fuzzy inputs are used in fuzzy control system that uses expert knowledge to inference fuzzy output. After defuzzifying the output is crisp value which controls overall scaling of the system. PREPARE [27] uses Pearson correlation to detect anomalous attributes and 2-dependent Markov chain model with tree augmented Bayesian networks model for prediction of VM performance anomalies. Further it uses prevention schemes such as elastic resource scaling and VM live migration. In [28] selects the remediation technique based on wasted time. In this work they combine the remediation techniques like preventive checkpoint and rollback. Regular checkpoint and rollback, preventive migration with regular checkpointing, preventive rejuvenation with regular check pointing. Further they develop performance models for calculating wasted time and use it to select the remedial technique with least value for wasted time. The model does not assume any failure distribution. However proposed work complements these approaches by developing an analytical model to select the best technique by considering overhead and impact of each remediation technique.

CONCLUSION
Fault remediation in cloud environment is important but relatively less explored research. The proposed fault remediation system in cloud uses an analytical approach to rank the remediation techniques based on overhead and impact parameters. An expert system is then implemented that maps faults to remedies and selects the best remediation technique based on analytical model developed. The simulation results show that system is able to rank and select the best technique for given scenarios. Further the efficiency of system is analyzed for overhead and availability parameters. It shows that availability of the proposed system is better than reactive fault management system while overhead is less. Thus the fault remediation system improves availability and reduce downtime by handling the faults proactively.