Projection pursuit Random Forest using discriminant feature analysis model for churners prediction in telecom industry

ABSTRACT


INTRODUCTION
The Telecom industry is a highly technological sector that has developed tremendously over the past two decades as a result of the emergence and commercial success of both mobile telecommunication and the internet [1].Customer churn or customer attrition is a great challenge for many telecom companies.It happens when a customer ends his subscription and switch to another competitor.There are many factors affect the customer's decision to change to another competitor.In general, such factors related to the high cost, bad customer service-related work, fraud and privacy concerns [2,3].Customer churn causes serious profit loss when exceeds certain limits.On the other hand, companies realize that attracting new customers is much more expensive than preserving existing ones.The initial and foremost step in curtailing outbound churn and establishing loyalty of the prevailing customers is to understand the reasons for churning.In this situation, 1407 the churn prophecy is a useful and helpful tool to forecast customer at churn risk.The only remedy to overcome churn business hazards and to retain in the company [4].Customer Churn Prediction (CCP) has been raised as a key issue in many fields such as Telecom providers, credit cards, internet service providers, electronic commerce, retail marketing, newspaper publishing companies, banking and financial services [5].CCP in Telecommunication companies has become an increasingly popular research issue in recent years and therefore, Telecom providers using widely strategies to identify the potential churn customers based on their historical information, prior behaviors and offering some services to persuade them to stay.On other hand, Long-terms customers are more profitable for the service providers, since they are more dependency to buy additional products and spread the customer's satisfaction in their circle, thus procedure will indirectly attract more and more customers [6].
Stockholders forced to search for alternative approaches for using machine learning techniques and statistical tools to recognize the cause of churn in advance and to yield instantaneous efforts in response.This is possible if the historical data of the potential customers analyzed systematically [7].Fortunately, telecom sectors produce and preserve a large volume of data, they include non-relational data i.e. billing information, demographic, customer care, customer behavior, and relational data i.e.Call Detail Records data (CDR) and network data.Moreover, not all the features of the telecom database used by all the prediction methods only the relevant features that really contribute to the CCP used in data mining (DM) techniques [8].
The statistical learning model discovers methods of approximating functional dependency from a given assortment of data.It covers significant issues in classical statistics such as discriminant analysis, regression methods, and the density estimation problem [9].Statistical learning is a kind of statistical inference, also called inductive statistics.Recently, statistical learning methods such as Support Vector Machines (SVM) and Linear Discriminant Analysis (LDA) have an important role in describing the differences between a reference collection of patterns and the population under exploration [10].
The main contribution of this research is to develop a new ensemble learning method for churn prediction method based on Random Forest constructed but with oblique trees principal using an optimal and linear association of randomly chosen predictors, which increases the predictive performance when the cutoff hyperplane between classes is in a linear collection of variables.The suggestion method called a Projection Pursuit Random Forest (PPForest).Moreover, using a visualization tool of Constructed PPForest and compare those with the Random Forest graph in order to understand how the PPForest model summarizes datasets.
The main difference with the known Random Forest approach is that the oblique partitions of variables not selected randomly.Nevertheless, the linear association in each tree construction is calculated by improving a projection pursuit index depend on a linear discriminant analysis (LDA) or Support Vector Machine (SVM) to discover the projection data of the variables that best splits the classes taken into account the correlation between the target variable and other dataset variables.PPForest outperforms a traditional Random Forest when separations between groups occur in Linear combinations of variables.
The PPForest uses the Project Pursuit tree (PPtree) as an oblique model for classification problem where the response variable is categorical and the method is define to use the quantitative feature, which built the tree from the available variables to enhance its performance in multi-class problems and in the presence of nonlinear separations [11].Two project pursuit indexes, LDA and SVM used in this research, PPtree as based on optimized the projection pursuit index to find low-dimensional projections that separate classes of the group.At each node, the PPtree uses the best projection to separate two groups of classes using LDA or SVM projection pursuit indices with class information.One class assigned to only one final node with the condition that the depth of the oblique PPtree cannot be greater than the number of classes.Therefore, the PPtree constructs a simple but more understandable tree for classification.The projection coefficients of each node represent the importance of the variables to the class separation of each node.To enhance the performance accuracy of the ensemble PPForest method and to improve the generalization of this model, a novel weak tree remover used to ignore the trees with lower out of a bag and tune the PPtree in order to enhance the performance accuracy of the PPForest in general.Chi-square method used for feature selection to prove if running the PPForest algorithm with a relatively small size of the dataset could improve the performance of the PPForest [12].After analysis the outcome of the proposed method based on classification performance metrics regarding different Telecom datasets in the number of observations and attributes, it has been shown that the proposed ensemble method using PPForest with LDA Indice has robust results of overall churners prediction system.Far from complexity computational in the terms of time and saving complexities, there are no differences in churn classification output of wither using feature selection method or not.The structure of the suggested paper prepared in sections as illustrate in follows: Section 1, present the introduction and previous studies about customer churn prediction in the Telco sector.Methodology, model building, data preprocessing, chi-square test, executed methods are described in Section 2. Section 3, illustrated the experimental implementation and outcomes of churn system are discussed.PPForest graph and huber plot of pptree visualize in sections 4 and 5.Conclusions are considered in Section 6. Churners and non-churners classification regard as predominant trouble for telecom providers and is defined as the missing of customers because they leave for competitors.Being able to classify customer churn in advance, provides the Telco company an appreciated insight to retain its customer base.Wide ranges of churn classification methods have investigated in recent years.Most innovative models make use of state-ofthe-art machine learning classifiers and identified that the origins of customer churn related to the quality of services, demographic factors, customer satisfaction/dissatisfaction, and economic value factors.

RESEARCH METHOD
The objective of the suggested scheme consists of building a classification model for indicating each individual client to be a potential churner or non-churner in Telecom datasets.This procedure will assist customer relationship management (CRM), by adopting the crucial retention policies that are likely to attract customers and attract who have the most tendency to churner and pursuit them to remain.The input for suggesting customer churn prediction (CCP) model includes information from past calls for each mobile subscriber, together with all the individual and business information preserved by the telecom service provider.After the prediction model entirely trained with the training dataset.Then, the model must be able to predict churners from the test dataset.The recommended methodology for churners prediction has been denoted as a schematic diagram as mentioned in Figure1 and the detailed explanation of the steps followed in given subsections.

Datasets
The practical part of the research is running on different Telecom datasets provided by various wireless Telco operators around the world.Table 2 summarizes the main characteristics of these datasets i.e. name, number of observations, number of attributes and Churn Rates and missing value percentage.As can be seen from the table, the smallest data set contains 2000 observations, and the largest up to 71000 observations.To implement CCP methodology this characteristic allows us to split each dataset randomly into 0.8 training set and 0.2 test set.The datasets also differ substantially regarding the number of attributes, in a range from 14 up to 163.However, more attributes do not guarantee a better classification model it means heavily increases in the computational complexity required to run the empirical codes of research.The final performance of a classifier mainly depends on the feature engineering of the attributes, and not on the number of attributes available.Most of the data, however, are collected over a period of three to six months, with a churn flag indicating whether a customer churned in the month after the month following the period when the data was collected.The table also indicates the class distribution, which is for all datasets heavily skewed.The percentage of churners typically lies within a range of 1% to 5% of the entire customer base, depending on the length of the period in which churn is measured.The table also shows the missing value rate, the presence of the ambiguity of these values has a significant influence on the low predictive accuracy of the CCP model.

Datasets pre-processing
Preprocessing is a data mining approach involved converting raw data into a comprehensible format.The actual information in the world often incomplete, inconsistent missing in certain behaviors and patterns and may have many mistakes.Pre-processing is proved for solving these problems.Most of the telecom datasets come with high missing values.Instead of removing variables and observations that have high missing values, another approach is to fill up in missing value variables.A diversity approach can be used in missing features imputation that ranges from extremely simple to relatively complex.This paper used the main method for exploring and fill with missing values called Predictive Mean Matching (PMM) [20].PMM technique is widely used as an outstanding method for variables imputation and has an attractive way to do multiple imputations especially for filling up the quantitative variables that are suffering from irregular distribution [21].PMM can be applied in two steps.First, the approximating mean function is predicting.Second, the data with missing value imputed by finding the similar fields in the dataset, this done by means of a nearest-neighbor technique then, the observed outcome value of the nearest neighbor can be used for imputation.

Features selection based on chi-square test
The most important step in data pre-processing is to identify attributes that are certainly relevant to the target variable.However, not all attributes are well contributed to the classifier learner model.Due to the wide-scale datasets in telecom provider services, the feature selection process became essential to improve the performance and make the CCP model easier to interpret, decrease overfitting, eliminating variables that are redundant and do not provide any information or contribution in the output of the model.Moreover, it reduces the size of the prediction problem and enables classification algorithms as possible to yield outcomes in a faster manner [22].The Chi-Square test is a nonparametric statistical analysis method commonly used to determine the significant relationship between dataset features [23].The methodology of measuring the independence between qualitative statistic values based on the Chi-Square test depicted in the following algorithm steps.
Chi-square independent test pseudocode: where r is the number of levels for one categorical variable, and c is the number of levels for the other categorical variable.b.The expected frequency count can be computed separately for each level of one categorical variable at each level of the other categorical variable as in Equation 2.
The Chi-Square random test of the variable (Χ2) defined by the following Equation 3.
where Or,c is the observed frequency count at level r of variable A and level c of variable B, and Er,c is the expected frequency count at level r of variable A and level c of variable B. c.The P-value is the probability of observing a statistic sample as extreme as the test statistic.Since the statistic test is a Chi-Square, the distribution calculator to assess the likelihood related to the statistic test.4. Interpret results: If the output samples are improbable means given the null hypothesis, the procedure rejects the null hypothesis.Typically, this involves comparing the P-value to the consequence level, and rejecting the null hypothesis when the P-value is smaller than the significance level [24].Table 3 show represents the Telecom datasets after imputation of the missing values and Feature Selection.

Project pursuit random forest (PPForest)
A Random Forest is an Ensemble-learning model built on bagging multiple oblique trees that represent independent decision trees with feature selection and generate the result of classification by feeding the input to these internal trees and collecting their outcomes based on voting technique [16].Most of the available traditional Random forests are vulnerable to overfitting in some Telecom datasets and do not handle huge numbers of redundant features.It is more efficient to choose a random decision boundary than using the available techniques, thus making larger ensemble methods are more achievable.Although this may seem to be a benefit it has the consequence of shifting the computation complexity from training time to assessment time, which is actually a disadvantage for most machine learning implementation [25].The most available random forest are separate features space by hyperplanes that are orthogonal to single feature axes when the data are collinear with correlated features, hyperplanes that are oblique to the axis do the better class separation.Trees that use linear combinations of variables in a node splitting procedure that included in the random coefficient generation known in the literature as oblique trees [26].
PPForest involves structured tree approaches with projection pursuit indices, for dimensionality reduction, they defined hyperplanes that are oblique to the feature axes in the decision tree that trained independently and has its unique structure and properties.In other words, PPtree optimizes a projection pursuit index to obtain a low-dimensional projection to separate classes and its classification problems where the response variable is categorical and the method is described to use quantitative feature variables [13].At each split, a random sample of predictors are selected and then an optimal projection pursuit random forest classification adapts random variables to utilize an optimal linear association between variables instead of only one variable for each split in the construction of the tree to build the PPtree, this order may lead to a diversity of decision tracks to achieve the final forest prediction, it is desired to understand and compare all decision tree tracks in the context of all trees structure [17].
One important distinguishing of PPtree is that it deals with the variables always as a two-class system when the classes are more than two the means of each class is determined and used to make a reduction to two groups only by using the distances between the means of classes.For example, if we have five classes and their means of the projected variables in each class are 2.1, 2.3, 2.5, 3.5, and 3.7, the classes with mean 2.1, 2.3, and 2.5 are set to the first group and the classes with mean 3.5 and 3.7 are set to the second group.Also, in each node of the PPtree, the projection coefficients denote the variable importance for the class splitting.This information is very supportive to select important variables by PPtree.PPForest outperform a traditional random forest when splitting hyperplane between classes occurs in a linear and randomly combination of predictors for separating the classes that computed by searches for a low dimensional projection pursuit index such as Linear Discriminant Analysis (LDA), Penalized Linear Discriminant (PDA), GINI, ENTROPY and Support Factor Machine (SVM) [27].
In the first step of the optimization problem and based on the class information, a projection pursuit index is used to find an optimal one-dimensional hyperplane for separating all data and project the training data into the projection line.Then, using the projected data to redefine the optimization method in a two-class problem by comparing the mean of classes, and assign a new label to each observation.The next step is to find an optimal one-dimensional projection to separate the two classes of the classification problem.Repeat all the steps until each group has only one class from the original classes.Based on these steps the tree grows and the maximum depth of each tree in the forest determined [16].Projection Pursuit Random Forest Pseudocode: 1.Let dn = {(xi, yi)}i=1 N , be the training dataset where xi is a p-dimensional vector of explanatory variables and yi represents class information with i = 1, . . .N. 2. Bootstrap samples: samples of size n randomly taking from the original dataset with replacement to create k number of ensemble trees to use as the training dataset and the remaining samples reserved as a test dataset for evaluating of the proposed churn prediction model.3. Grow the oblique tree (PPtree): for each bootstrap sample build the oblique tree structure without pruning as detailed below: a. Optimize a projection pursuit index to calculate an optimum one-dimensional projection plane α using LDA or SVM for splitting all classes in the current bootstrap samples and yield a projected data z=α x. b.On the projected data z, repeated decrease the number of groups until produce two classes only, by comparing the means of data, and assign a new label G1 or G2 to each class.c.On the projected data z, redo Project pursuit with these new class labels (G1, G2) and finding the onedimensional projection path α* and assign a new group label G1* or G2* to each group which can contain more than one original class.d.Determine the decision rules c which is the best separation of G1* and G2* and keep both α and c to providing the decision boundary for the node.e. Split data into two sets in each node in the tree then, using the new group labels G1* and G2*.
If α * TM1<c then allocate G1* to the left node else allocate G2* to the right node, where M1 is the mean of G1*.f.For each group, stop if there is only one class else repeats the procedure, the splitting step iterated until the last two classes separated.g.One class assigned only to one final node; the depth of the tree is at most the number of classes.4. Repeat step 3 for k = 1…, B where B count the tree in the forest.5. Produces the ensemble oblique trees, based on the majority vote mechanism to predict the class for training data.6. Predict the classes of each case not included in the bootstrap sample and compute miss-classification error and system accuracy.7. The projection coefficients used to obtain the dimension reduction at each node used to measure the variable importance.8. Weak tree remover (classifier): To enhance the performance accuracy of the PPForest algorithm and to improve the generalization of model, batter trees with high performance selected based on the lower out of bag error for classification (OOB error) that use to tune the model and avoid the trees with the worst outcome.9. Determine the majority voting technique, and evaluate the system based on the selection of good oblique trees. Int

Discriminant function analysis (DFA)
This section introduces and discusses some aspects of statistical learning philosophy concern to discriminant SVM and LDA.It is a statistical procedure used to solve problems associated with the statistical separation among distinct classes with the assumption that the sample is normally distributed for the attributes along with homogeneous variance-covariance matrices [28].The linear models are easy to understand where the final output is a weighted sum of the input attributes 'xi'.The magnitude of the weight 'wi' shows the importance of input and its sign indicates if the effect is positive or negative.Most functions are additive in that the output is the sum of the effects of several attributes where the weights may be enforcing or inhibiting [29].

Linear discriminant analysis (LDA)
This paper introduces the oblique tree algorithm for churner classification that can simultaneously shrink the tree size, solve the problem of the curse of dimensionality, enhance class classification, and improved tree data and structure visualization.This can be achieved by predicting a linear discriminant model to the data in each node on the tree using the discriminant function.LDA is a kind of Discriminant Function Analysis (LDA) that discoveries linear functions of the associated variables that lead to maximum discrimination between the group centroids [30].LDA is a simple and mathematically robust technique frequently used in pattern recognition applications as a dimension reduction technique, object classification into mutually exclusive and exhaustive groups and maximizes the inter-class scatter, minimizes the intra-class scatter concurrently and discoveries appropriate project pursuit directions for classification problem [31].
Three steps needed to perform the LDA calculation.The first stage is to find the distance between the mean values of different classes which are called the between-class variance (SB), while the second stage involved the calculation of the distance between the mean and the samples of each class which is likely known as within-class variance (SW).The third one is to create the lower-dimensional space which maximizes the between-class variance and minimizes the within-class variance [32].To achieve the main goal of these steps, LDA attractive procedure that makes class assignments by formative the linear transformation of the data in feature space that maximizes the ratio of the between-class variance to minimize the within-class variance.In the two-class variable, the maximum class splitting occurs when the vector of quantities, 'w', and intercept with 'y' vector b, used to express the linear transformation as in Equation 4. The classes are wellseparation, which implies that after the original data are projected the distance between the two means is large, and the distance of instances around each mean is small [33].
where Σ-1 is a variance-covariance Matrix, and µ represents the mean vector of class k. k is the prior probability of the kth class.To find the between-classes variance (SB), the separation distance between different classes that denoted by (  − ) will calculate as in Equation 5.
where   = (  − )(  − )  represent the separation distance between the mean of the ith class   and the entire mean μ.Then, the total between-class matrix is calculated by adding all the between-class matrices of all classes SBi.The total within-class matrices (Sw) are calculated as in Equation 6.
In the above equations, xki and ̅  denote the ith training sample of class k and the corresponding class means, respectively.After finding the between-class variance (SB) and within-class variance (SW), the index matrix (Wlda) of the LDA technique can be calculated as in equation 7. The solution of this equation can be calculated by finding the eigenvectors and the eigenvalues of   =   −1   .The eigenvalues are scalar values that provide information about the LDA space while the eigenvectors represent the directions of the new space [34].The robustness of the eigenvectors reflects the ability to discriminant between different classes.The projection pursuit algorithm searches for a low dimensional projection that optimizes the LDA.The eigenvectors with the k highest eigenvalues used to construct the lower-dimensional space of LDA while the rest are negligible.The projection pursuit index   is an essential in a projection pursuit LDA because it leads to achieve the purpose of the method through its optimization.The basic in projection pursuit is to find what projections pursuit is interesting [35].The distinct benefit of the projection pursuit way over methods can avoid the curse of dimensionality by focusing on the low dimensional projections and can ignore the redundant features.

Discriminant SVM framework
Recently, many researchers favored using SVM as a supervised machine learning algorithm.It has been obtained well reputation in the data mining methodologies due to its promising experimental performance, reasonable memory, and time complexity with its strong mathematical basis signifying that SVM be a competitive classifier [36].SVM regarded as one of the most influential machine learning algorithms that can be applied in large domains of real-world applications and produce many benefits over traditional classification and regression techniques.One of the greatest significant rewards is the solution of problems relates to a small subset of the original dataset, which make SVM as powerful computation, robust mathematical contextual, better generalization skill corresponding to other classification methods [37].
One remarkable characterize of SVM and other kernel-based computational methods work in multidimensional without significant computation cost and feature selection methods, its robustness against the error of models and has the ability to learn well with only a very small number of features.However, the main weakness of SVM that arises from it is the training phase is computationally expensive due to a good estimation of it is constant parameters such as gamma, sigma, and degree.Moreover, it is highly reliant on the size of the original dataset t [38].This linear classifier is also known as an optimal hyperplane, the features are normally normalized to generally lie between -1 and 1 so that the samples can be divided into two distinct classes.Discriminant functions calculated by SVM are efficient ways for projecting of multidimensional data in a direction perpendicular to the discriminating hyperplane.Then, the projected data fitted to estimate and display the posterior probability densities and enhancement the classification accuracy of discriminant function [39].The basic idea of classification is to try to separate different samples into different classes, for binary classification the prediction of linear hyperplane described as in Equation 8.
where w and b are the weight vector and a constant respectively, which have estimated from the dataset in n-dimension space,   . is the internal product of w ∈ R n and x ∈ R n vectors [40].The dataset can be separated geometrically by a hyperplane.It should build two hyperplanes so that the hyperplanes are as far away as possible, and no samples should be between these two planes this arrangement mathematically represented by Equation 9.
From this equation, it is straightforward to confirm that the normal distance between these two hyperplanes (d) is the reverse relationship to the norm ||w|| via Equation 10.
The hyperplane can mathematically represent by using (11).
Where: sign () is known as a sign function, αi are non-negative Lagrange coefficients calculated by resolving a quadratic optimization function based on linear and inequity constraints.The training observations 'xi' with non-zero αi finds on the frontier of the margin called support vectors (SV).The transformation should be chosen in a confident way so that their dot product leads to a kernel-style function [41].The kernel function is LDA Assumes that data are normally distributed, all classes identically Gaussian distributed, in case, the classes have different covariance matrices then LDA becomes quadratic and not linear discriminant analysis [42].However, SVM assumes that all classes are very separable; it makes use of a slack variable that permits a certain amount of overlap between the classes.SVM is a precise flexible prediction method that makes no expectations about the input datasets at all.The flexibility, on the other hand, was frequently given it more difficult to understand the outcomes from an SVM classifier, as compared to LDA.Moreover, LDA makes use of the complete input dataset to approximation covariance matrices that are somewhat prone to outliers.While SVM optimization functions over a subset of the data that locate on the separating margin [43].

RESULTS AND ANALYSIS
Robust practical setup and use of statistical tests and appropriate performance measures are essential to figuring a correct conclusion.The telecom industry reflects different types of measurements to assess the performance of the churn prediction model.

Accuracy
Count the correct predictions accomplished by the prediction model over all kinds of predictions made.Overall, how often the classifier model is correct

Precision (Confidence)
The number of positive cases that correctly recognized.

Sensitivity (Recall)
The amount of actual positive cases that correctly recognized.

Specificity
The amount of actual negative cases that correctly recognized.

Error rate (ER)
The number of all negative predictions divided by the total number of samples, how much is the inaccurate prediction or misclassification on the predictive method ER = FP+FN TP+TN+FP+FN (18)

F-Score
Precision is invaluable for assessing the performance of data mining classifiers, but it surely leaves out some facts and for that reason will also be complicated.The Recall is a portion of the true optimistic predictions to total positive observations in the dataset.Compute the percent of churn rate that appropriately

Receiver operating characteristic curve (ROC)
ROC is a depiction of the relations between the benefits and costs, a plot in two-dimensional space of x-and y-axes in linear scale and used to summarize the trade-off between recall and 1-specificity.The ROC chart commonly used to visualize the performance of churners classifier over all possible thresholds for assigning observations to a given class.It generated by drawing the true Positive rate that represents the churners ratio correctly predicted as churners against the false positive rate that represents the non-churners ratio incorrectly classified as churners [45].

The area under receiver operating (AUC)
Measures the area below the ROC curve, the diagonal line represents a random process, it has an AUC of 0.5, thus the AUC of a reputable churn classifier should be a lot higher, preferably virtually 1, as a worth of 1 represents ultimate classifier.It represents a tradeoff between specificity and sensitivity of the model.The field-specific by means of AUC represents the chance that a random pair of churning and non-churning customers are properly identified, i.e.A positive instance receives a greater rating than a negative instance.Although AUC is well known and widely used it has been shown to be incoherent when comparing different methods [46].

Kolmogorov-smirnov test (KS)
The KS test measures the performance of classification models by match a sample with reference likelihood.Measure the amount of separation between desirable and undesirable distributions.The KS statistic test gives the maximum distance between the ROC curve and the sloping at a specific cut-off point.In most prediction models, the KS test falls in the range of zero and one, the higher value means better model in separating the positive from negative classes.

H-measure
Some researchers have been proven that the problem of the AUC is that it depends totally on the use of the data mining method and differed based on the classification method.H-measure is successfully overcoming the variance of AUC, so it captures the performance advantage of AUC but not its flaw i.e. incoherent and potentially misleading yields when the ROC curve is cross [47].

Lift measure
The effectiveness of the prediction model is expressed in the lift curve, which shows the fraction of all churners that may be caught when a designated fraction of subscribers used to be contacted.This is equal to the ratio between the sensitivity and the ratio of predicted churners after applying the churn model to the testing dataset.Formula 20 represents the lift value [48].

Gini coefficient
Evaluator that is carefully concern with the AUC chart is Gini coefficients are equal to double of the area between the ROC arc and the baseline means Gini =2*AUC-1.The Gini coefficient differs between zero when the ROC curve locates on the diagonal then, the classification model does not achieve better than a random classification model.While one value means the maximum ROC curve and perfect classification.

Out-of-bag error
For each oblique tree model, in the bagging forest, some cases of the original data are not used.Predicting the response for these cases gives a better estimate for the error of the model with future data.The OOB error rate is a measure for each bagging model and used to provide the overall error of the ensemble [49].

Cost
Many of the above metrics attempt to take a balanced view between FP and FN.A principled method to acquire this is to introduce the suggestion of misclassification expenses.Let c in [0, 1] denote the 'price' of

Youden index
Youden index is one of the well-known measures of diagnostic measure of accuracy.It is a global measure of test performance, used in the evaluation of the overall discriminative power of a diagnostic procedure and comparison of one test with other tests.It can be calculated via sensitivity-(1-specificity).It ranges from 0 for poor diagnostic accuracy and to a value of 1.0 for a perfect diagnostic test [50].Different prediction modeling techniques will result in different performance based on the evaluation criteria using different data and different telecom scenarios.The best performance of the SVM model using the Radial Basis Function (RBF) kernel can be accomplished when choosing the constant parameters of SVM as shown in Table 4.After completing the constructing of proposed PPForest model performance evaluators will help in evaluating model accuracy, significant functions of evaluation metrics are used to assist the skill to discriminate among model outcomes.In this section, the churners/non-churners prediction methodology is assessing by comparing the performance of two discriminant functions used in the construct of PPtree.The first method used SVM as a project pursuit index in the construction of PPForest to differentiate between churners and nonchurners customer classes.The second method is LDA to achieve linear splitting of variables node during oblique PPtree construction to produce individual classifiers that are an ensemble, robust and more diverse than classical Random Forest.Tables 5 and 6 depicted the performance of the proposed churner classification framework using ensemble PPForest using two techniques, LDA and the other is SVM.In order to prove that PPForest makes important feature selection, the performance of the churn prediction model evaluated using the comparison of datasets after applying feature selection strategy based on the Chi-Square statistical test and the whole features of telecom datasets.
Depended on the performance of the proposed PPForest method in this research and the exploration studies by other researchers, it is predicted that the LDA with linear project pursuit index could be used as the classifier of customer churners in telecom datasets and it produced uplifted outcomes than SVM in most of the performance measures in terms of class discrimination by using telecom datasets where the discriminatory information not aligned with the direction of maximum variance.The respectable accuracy value depicts that the LDA has the best performance in some telecom datasets.Moreover, better AUC, ROC coefficients and KS statistical tests show that the prediction model can retain more covert churner customers with less cost in the Telecom sectors and diverse churn rates.The reason for reasonable outcomes is acquired based on the proposed models, the LDA and SVM models, which have appropriate kernel function and constant values parameters that make churn prediction model on structural risk minimization which includes empirical risk and confidence minimization.

CONCLUSION
PPForest is a new methodology to construct a decision tree in which LDA or SVM employed to separate dataset features.Using PPForest as a forest procedure will reduce the decision tree size without sacrificing of prediction accuracy; this could be done by attractive the full benefit of the visual influence of multi-dimensional graphical displays and the predictive influence of LDA and SVM.One of the main advantages of an oblique decision tree is that it successfully uses an association between variables to find class separations, and it has a visual illustration of the variances between classes in feature space that can be used to understand model outcomes.PPForest uses the correlation between predictor variables to find the best separation between classes.It has shown that PPForest achieves better predictive performance than even a random forest where the correlation between predictors is large.Projection pursuit solves the problem with the original random forest algorithm, where oblique projections were an option but effectively useless because it simply used arbitrary projections.The space of projections is very big, so the random forest rarely can find good oblique projections.Additionally, the tree structure produced by PPForest are tested by weak tree remover procedure to ignore it, this step improved the accuracy of PPForest and subsequent one-dimensional projections of the data made for convenient visualizations of the group separations, especially for multiclass classification problems.The running of the proposed system illustrates, that the decision PPtree is not essentially easy to understand.Easy of interpretation diminutions rapidly with increasing tree size.While tree size naturally grows with the number of features.The experimental results have shown that PPForest using LDA with weak tree remover is better than PPForest using SVM in many Telecom datasets based on the evaluation measures.

Int
Projection pursuit Random Forest using discriminant feature analysis model ... (A.Mahdi Naser) 1415 to use k (xi, xj) such that its discretization Kij = k (xi, xj) is a positive certain matrix.The decision prediction can then represent as in 12. f(x) = sgn(∑ α i n i=0 y i k(i, j) + b) How often does the positive condition occur in the sample.

F
ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 10, No. 2, April 2020 : 1406 -1421 1416 categorized as churn/non-churn.The prediction models that have a low Recall means it miss-classifies a great amount of the positive cases [44].

1419Figure 3 .
Figure 3. Huber's plot and the visualization of one tree of Larose telecom data

Table 1 .
The Literature Review of recent research related to the suggested CCP model based on Ensemble PPForest algorithm.

Table 1 .
The literature review Projection pursuit Random Forest using discriminant feature analysis model ... (A.Mahdi Naser) 1409

Table 1 .
The literature review (continue)

Table 2 .
Summarize Some of Research Datasets Formulate analysis plan: The analysis strategy describes how to use samples of data to accept or reject the null hypothesis.3.Analyze sample data: Using data sample, find the degrees of freedom, expected frequencies, test statistic, and the P-value associated with the statistic test.a.The degrees of freedom (DF) are computed and equal to Equation 1.
1. State the hypotheses: The statistical test for independence can be applied to categorical variables.The null hypothesis states wither the variables are independent.The alternative hypothesis states wither the variables are dependent.ISSN: 2088-8708  Projection pursuit Random Forest using discriminant feature analysis model ... (A.Mahdi Naser) 1411 2.

Table 3 .
Telecom datasets after apply PPM and chi-square zero object as category one (FP), and 1-c the fee of misclassifying a category one object as category zero (FN) based on the calculated confusion matrix.It's implicitly assumed that the two misclassification expenses sum is 1.Minimal Error Rate (MER), Minimum Weighted Loss (MWL).
Projection pursuit Random Forest using discriminant feature analysis model ... (A.Mahdi Naser) 1417 misclassifying a category

Table 4 .
Parametres estimation for best SVM performance