A simplified predictive framework for cost evaluation to fault assessment using machine learning

ABSTRACT


INTRODUCTION
The advancements in computing methods, embedded system, communication standards, and artificial intelligence (AI) has disrupted the whole ecosystem of the conventional approaches of the software scope [1], [2].However, the software development process always aims to meet the quality and safety of all software vulnerabilities [3], [4].The faults that may cause a functional error or vulnerabilities must be detected and removed.However, it can happen only when the faults are localized.Localizing faults in any system, including software, is the most expensive and challenging requirement before handling the faults [5].Recently, many novel and fundamental approaches have been proposed for fault localization, providing new dimensions and scope [6]- [13].The broader classification of the fault localization techniques (FLT) is given into two classes: i) static-FLT and ii) dynamic FLT [6].However, static FLT in the context of software programs checks for bugs or defects in the source code concerning the standard templates of the program models [14].On the other hand, dynamic-FLTs work on the principle that it does not have prior knowledge of the program and tag only ISSN: 2088-8708  A simplified predictive framework for cost evaluation to fault assessment using machine … (Deepti Rai)

7029
(SVR), and ANN considers optimization for the learning process, which targets reducing the training error to a greater extent which has not been studied much in the past; iv) the response variable of thousand (k) lines of codes (KLOC) indicates the possibility of faults and having a relationship with the cost modeling, which is also formulated in this proposed study; v) study also constructed the model of ANN with optimized execution flow of the Adam optimizer, which has resulted in smoother execution of the ANN learning with considerable errors.It also shows that the training loss performance has been significantly improved and gradually decreases in the case of ANN; and vi) the study provides a detailed experimental outcome that justifies the suitability of the three different learning models for predicting software fault and shows how accurately a model can learn.

METHOD
The proposed system adopts an analytical research method where the very fundamental input for the learning model is the historic dataset.Let us consider dataset D, which is the pair value of {x, y}.As per the basic fundamental principle of learning a model, the dataset is subjected to be divided into two segments represented by two variables.Here x represents the predictor variables, whereas y corresponds to the response class.Figure 1 clearly shows that the {x, y} value pairs are further considered in the learning model where trainable parameters are m and c, which refer to the slope and y-intercept, respectively.The optimizer is basically incorporated to optimize the learning performance and aims to reduce the empirical error corresponding to the predicted response.The loss function incorporated in Figure 1 shows how 'Yp' and 'Ya' variables are used to compute the training loss, which is also referred to in the proposed correlative cost modeling of software fault prediction.The study considers this baseline strategy of learning model as a foundation to develop this framework where three ML-based approaches of LR, SVR, and ANN are evaluated.The system model corresponding to the proposed framework is further illustrated below.

LR learning model
In the first predictive model, fault prediction is realized as a regression problem.In this model design, it is assumed that the numeric output r is the summation of the deterministic function of the input and additional random noise(∈).It is an extra hidden variable that cannot be observed.This can be realized with the following mathematical as (1), here the unknown function ƒ() is to be approximated considering the estimator g(⋅).Here the values of x indicate the training data of software fault, and this supervised approach of ML fits a function f(.).To this, x is to learn y as a function of x.Let us assume the input attributes of the predictor variable for the software fault data is x and can be represented using (2).Each training sample for software fault data can also be represented as an ordered pair of (x, y).If the training set consists of the total S number of samples, (3) shows the representation of the ordered pair.
here r t ∈ R and also ƒ() is unknown with random noise ∈ from (1).The software fault dataset contains an   predictor variable where n = 25.The fitted function can also be realized using as ( 4) and ( 5), here w, and w0 consist of suitable numerical values representing slope and intercept, respectively, for (4).In (5), the software fault prediction problem is further generalized as a multiple linear regression problem for different regression coefficients and independent variables.As shown in ( 2) and (3) highlight that the LR learning model is considered a supervised learning problem.The function approximation also takes place considering the numerical adjustment of weight factors w.In the context of ML for predicting a software fault, the task for the LR model is to formulate a mapping between  → .Machine learning in this software fault prediction context is that the model is defined concerning a set of parameters which can be expressed with the (6).
In (6), y produces a number in regression outcome from evaluating the model (⋅).Here  indicates the model parameters.The regression function (⋅) is modeled in such a way that the ML program aims to optimize the parameters  in the given function.The prime motive is to minimize the approximation error so that the estimated outcome would become closer to the actual values given in the software fault data training set.The prime aim is to construct the (⋅).That can reduce the empirical error.If () is linear, then it can be generalized using as (7).
The linear model constructed for the prediction of response in r in reference to y shows that it estimates the response for software fault prediction, which can also be evaluated in the cost measure (C).Here the function is approximated by learning from the data where the parameters   and  0 learns from the data x.The values of   and  0 minimize the empirical error in the training set concerning the following as (8).
The design model of the presented LR is customized by applying the estimator g(x|θ) on ƒ().The estimator approximates the ƒ() response for software fault prediction in the measure of cost.Here the set of parameters  is also defined for the measure of learning.It is assumed that the value of ∈ is zero-mean Gaussian with the constant variance of  2 .Then the normalized value approximation for random noise can be represented with ∈ ~ (0,  2 ) for normal distribution.The normal distribution here basically approximates the discrete distribution.The substitution of the estimator g(⋅) in the place of ƒ(x) provides the probability of the output given in input using as (9).
A simplified predictive framework for cost evaluation to fault assessment using machine … (Deepti Rai)

7031
As shown in ( 9) depicts the normal distribution computation of the x values of software fault data considering joint probability density.This indicates in this linear model of software fault detection; 0 mean Gaussian noise is added.The LR model in this proposed framework inherits normal distribution's standard and potential properties to study the software fault data [24].The study also considers maximum likelihood computation (ℒ) to learn the parameters defined in θ.The pair {x t , r t } S t=1 in training, sets are drawn considering the unknown probability density measure of (, ), which can be mathematically expressed as (10), p(x, r) = p(r|x)p(x) (10) here, the (, ) represents the unknown joint probability density estimation, measured as a product of the output given the input.Also, () indicates the probability of the input density.The input of x and the parameters  in the model g(x|θ) also formulates the problem of multiple linear regression.The learning algorithm is further designed considering the training set of fault data in (, ).The approximation error is further computed considering a loss function of r and g (x|θ).The approximation error for the loss function (E) is designed using as (11), When learning the class corresponding to a software fault, the LR model uses the square of a difference considering (9).The LR model in this proposed study also applies an optimization procedure to extract θ * This can minimize the error corresponding to predicted or approximated response of software fault, which can be represented using as (12).
The depreciation of error E also shows how accurately the model of the approximating function of g(x|θ) is learned concerning software fault's intercept and slope data.The response class of KLOC is also considered in the dataset.Estimating LR model learning and prediction accuracy is further performed considering a set of performance metrics such as mean squared error (MSE), root mean square error (RMSE), mean absolute error (MAE), and mean magnitude of relative error (MMRE).

SVR learning model
The core idea of the study also incorporates an SVR-based learning model for predicting software fault data concerning the cost measure of KLOC.The model of the SV algorithm follows a nonlinear generalization metric.The idea of SVR is to minimize the generalization error bound instead of observed training error to achieve generalized performance in the context of software fault prediction.It is functionally modeled based on the computing procedure of LR in a high-dimensional feature space.It has to be noted that in that feature space, the input data are mapped via a nonlinear function.The core design of SVR in the proposed framework of the correlative cost model of software fault prediction considers input training data in the form of {(x 1 , y 1 ), … … , (x n , y n )} ⊂  × ℝ.Here  represents the space of fault data pattern for the input instance  = ℝ  .Regarding ℰ-SVR, the prime target is to compute the function ƒ(x) That yields a maximum ℰ for deviation concerning the gradually obtained response of software fault in y.The y is obtained for all the training data (d) related to software faults.The analysis in the context of SVR does not care much about the errors as long as the error < ℰ.However, it does not permit any deviation larger than ℰ.The linear function of ƒ(x) parameterization in SVR can be described using (13).The SVR constructs this software fault prediction problem as a convex optimization problem.It aims to minimize the||w|| 2 = 〈, 〉.The convex optimization problem in the proposed software fault prediction context can be formulated using (14).
The presented study evaluates the model of SVR on the learning process from the software fault dataset of 〈 x • w〉 Where x represents the predictor variables.The core objective of the function ƒ(•) is to approximate the value pairs of {x: y} with a considerable outcome of ε.The model also checks whether the convex optimization problem is feasible for all the inputs of {x: y} from the software fault dataset.To deal with the other infeasible constraints of optimization in SVR slack variables ,  * Are also introduced.This approximates the trade-off between the flatness of ƒ(•) and the amount up to which the range of deviations > ε are tolerated.

Learning model for ANN
The study also considered the significant aspects of ANN and their AI origins to address this software fault prediction problem.In the ANN model, information input is propagated among a network of nodes.The nodes in this model mathematically interact with each other, which the user does not know.Eventually, the model computes an output for the value pair of {x: y}, formulating a relationship between x and y.This further map the expected macroscopic input-output pattern of the relationship.The interaction between nodes is adjusted until the model finds desired input-output outcome.The prime construct of ANN considers three distinct layers with interconnection among nodes.The ANN constructs three prime layers where the input layer receives information from external sources and further relays the information to the ANN for model processing.The hidden layer processes the information received from the input layer and further passes it to the output layer.The output layer receives the processed information and sends it to the external receptor.In the proposed study, the ANN constructs the model to retain information by connecting nodes with neighbors and the magnitudes of the signals.The ANN in the presented study addresses the problem of imbalanced data and considers processing noisy, incomplete, and inconsistent information.Here each node encodes a micro-feature of the input-output pattern.One novelty aspect of this model is that, unlike other computational techniques, this approach considers micro-feature computation.
Here each node acts as a processing element to deal with the software fault data of {x: y}.In each PE of ANN, most of the calculations are performed.The j th node of the processing element considers an input vector of x with the components of  1 →   and perform processing.This yields the output of y as a function ƒ(x, w).The ANN model in the proposed system of software fault prediction considers the (15) mathematical expressions.
Here every input is multiplied with its corresponding weight factor, and the weighted inputs are considered in each node for further calculations.Here the threshold   for jth node control the activation of the node.Similarly, for all n number of nodes, the total activation can be computed using (16).
In the proposed system of software fault prediction, the study considers a sequential model of ANN and further construct   and prepares the input layers for   .The proposed execution modeling of ANN incorporates adam optimizer to optimize the execution flow of software fault prediction and also aims to reduce the training error for software fault detection.The study has found that training loss has significantly dropped over incremental epochs, ensuring that the optimized model has attained better learning accuracy for software fault prediction.The models mentioned above of LR, SVR, and ANN consider the training and testing data of [  ,   ] to fit the models and further with   , the models perform prediction of the software fault data.The next section further discusses the experimental outcome obtained from simulating these ML-based software fault prediction approaches.

RESULTS
This section illustrates the experimental outcome obtained after simulating the proposed correlative cost framework for software fault prediction in a numerical simulation environment.The study considers the interactive scientific computing tool Jupyter Notebook to realize these models where initially the framework extracts the data frame (d) from the fault data F. The system configuration considers 64-bit windows operating system supported by an Intel i7-processor with CPU@260 GHz, 2.59 GHz, and 16.0 GB internal memory.The A simplified predictive framework for cost evaluation to fault assessment using machine … (Deepti Rai)

7033
Jupyter notebook application uses the local host: the 8888 servers, and enables the kernel processes.The original format of the dataset comes as "attribute relation file format (Arff)," which is suitable for a machine learning application, namely "WEKA," and this dataset is quite suitable for the regression task.It consists of '10885' different projects with the varied line of codes (LoC) ranging between [Max, Min].The computing of the data frame  can be visualized in Table 1.Table 1 shows that the system initially computes the data frame considering the input dataset [32] by reading and importing the dataset into the system's scope of execution.The table is constructed as a matrix of 5 rows × 26 columns.Further, the study also considers computing the definition of the data frame where most of the predictor variables are found of type float64.In contrast, the variable record number and model are considered objects.The model considered in the dataset is referred to as 'cococmoII.'Further, the framework also performs statistical computation to visualize the descriptive statistics from the data.The study considers the described count measure for the software fault data, which are further considered as training samples in the form of variables followed by mean computation for software fault training data from the statistical point of view.The preliminary statistical evaluation shows that the mean values are quite higher in the case of docu and ACT_EFFORT.A similar computation is also performed for standard deviation computation.The outcome of the standard deviation variables for the training samples exhibited higher scores for the predictor variables docu and ACT_EFFORT.Further, min and max computation is carried out for the software fault data training variables and it is tabulated in Table 2.The study further compares the training outcome for LR, SVR, and ANN concerning the standard performance metrics of MAE, MSE, RMSE, and MMRE.The computational analysis of MSE and RMSE is illustrated in Figure 2. The prime novelty of the proposed outcome is manifold: i) the ANN model exhibits approximately 98.97% reduced MSE, 97% of reduced RMSE, 98.78% of reduced MAE, and 54.6% of reduced MMSE as compared to conventional LR and SVR model and ii) it was also found that overall processing time of ANN is 0.4337 s while that of LR and SVM scheme is approximately 1.107 s and 0.8336 s respectively.On the basis of this outcome, it can be eventually stated that ANN offers better predictive performance with higher accuracy towards fault prediction in software design.Hence, better form of cost-effective predictive modelling is presented in proposed scheme.

CONCLUSION
The study introduces a numerical framework of correlative cost modeling for software fault prediction considering three popular learning models: LR, SVR, and ANN.The proposed system model considers a standard software fault dataset and evaluates these three models for prediction, considering numerical modeling and implementation.This work's novelty is that it addresses the data imbalance problem and optimizes the performance of learning models to minimize the empirical error of the predicted response class for KLOC.The study considers KLOC as a response variable to predict the possibility of faults in the code.The experimental outcome clearly shows that ANN outperforms the other models in learning accuracy among LR, SVR, and ANN.It clearly shows that the Adam optimized in ANN not only exhibits its considerable execution performance but also ensures very negligible training loss and learning errors in the measure of MAE, MSE, Int J Elec & Comp Eng ISSN: 2088-8708  A simplified predictive framework for cost evaluation to fault assessment using machine … (Deepti Rai) 7035 RMSE, and MMRE.This indicates that with the training data, the ANN model has been trained effectively, and it maximizes the possibility of accurate fault prediction from the considered dataset.Another interesting point to be noted in this implementation is that in order to carry out analysis of fault tolerance in software engineering, proposed machine learning based approach offers cost effective solution and does not demand any adoption of complex form of new evolving deep learning schemes.This approach can help the companies maximize their profit by minimizing the cost of production and deployment of line of codes in software programs.The future work of the proposed scheme will be further carried out towards optimizing more software metrics considering more number of risk factors and uncertainties.

EOptimizerFigure 1 .
Figure 1.Architectural model for the learning mechanism from the dataset


ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 13, No. 6, December 2023: 7027-7036 7030 r = ƒ(x)+∈ Here KLOC: the indicator (L), i.e., thousands of lines of code, refers to how large a computer program is.More line of code indicates the possibilities associated with more fault occurrence.The presented approach to learning considers  = (|) and divides the training dataset of software faults accordingly.The customized function of the model of LR fits the training data considering  = (|), here the training of the model takes place concerning x, y.A closer observation of the predictor variable xn shows that during the splitting of x and r from xr of software fault data x 1 : x 24 becomes the predictors where ( ← x 25 ) becomes the response variable (r) from the data frame dϵF (which indicates the KLOC value).The study invokes a customized functional module of Θsplit(x), which considers the input of x = x1→x24 and y = x25 with test size (tsize = 0.2) and random state (rstate) and computes the data of [x train , y train ].The fitness of the model g(x|θ) is further evaluated concerning [  ,   ], this ensures the learning from the data.Finally, the software fault prediction model outcome generates the response of y considering the test data (xtest).

Table 1 .
Computation of data frame d for {x: y}

Table 2 .
Descriptive statistics of {x: y}