Credal Fusion of Classifications for Noisy and Uncertain Data

This paper reports on an investigation in classiﬁcation technique employed to classify noised and uncertain data. However, classiﬁcation is not an easy task. It is a signiﬁcant challenge to discover knowledge from uncertain data. In fact, we can ﬁnd many problems. More time we don’t have a good or a big learning database for supervised classiﬁcation. Also, when training data contains noise or missing values, classiﬁcation accuracy will be affected dramatically. So to extract groups from data is not easy to do. They are overlapped and not very separated from each other. Another problem which can be cited here is the uncertainty due to measuring devices. Consequentially classiﬁcation model is not so robust and strong to classify new objects. In this work, we present a novel classiﬁcation algorithm to cover these problems. We materialize our main idea by using belief function theory to do combination between classiﬁcation and clustering. This theory treats very well imprecision and uncertainty linked to classiﬁcation. Experimental results show that our approach has ability to signiﬁcantly improve the quality of classiﬁcation of generic database.


I. Introduction
There are two broads of classification technique: supervised and unsupervised one.Supervised classification is the essential tool used for extracting quantitative information based on learning database.All extracted features are assigned by labels examples.It tries to classify objects by measuring similarity between new and learning database.The second technique is based on clusters by measuring two criteria essentially compacity and separation [3,2].It tries to form clusters which are compact and separable the possible maximum.Grouping data is not evident.Firstly clusters are overlapped most of the times.Secondly, data to classify are generally very complex.Moreover, there is not a unique quality criteria to measure the goodness of classification.
Generally, validity index is used to measure the quality of clustering.Until now, there's no standard one which is universal.It varies from an application to another.Data to classify are not always correct especially in real applications.They can be uncertain or ambiguous.They are dependent of acquisition devices or expert opinions.Consequently, the result of classification will be uncertain.Besides, labeled examples used for training may be sometimes not available.Due to these limits and for the objective to improve classification process, we propose to combine classification and clustering.This combination also named fusion procedure aims to take account of the complementarity between both.Clustering is used to overcome problems of learning and over-fitting.Combination is made by using belief functions theory.This theory is well known in treating problems of uncertainty and imprecision.
In this paper, we report our recent research efforts toward this goal.First we present basic concepts of belief functions theory.Then we propose a novel classification mechanism based on combination.New process aims to improve classification results related to noisy environment and missing data.We conduct experiments on generic data to show the quality of data mining results.The rest of this paper is organized as follows.Related work on noise handling is discussed in Section III.In Section IV, we describe the details of proposed fusion mechanism.Experimental results and discussion are presented in Sections V.In Section VI, we conclude this paper out the future work.

II. Belief function theory
Fusion is a combination process of multiple data or information coming from different sources in order to make a decision.The final decision is better than the individual ones.The variety of information implied in the combination process makes the added value.Combination is needed in problems where ambiguity and uncertainty are big.We may be sometimes unable to make an individual decision.To raise the ambiguity, we must fuse.The applications requiring fusion are multiple.We find medicine [23] for example.Sometimes, it is difficult to do a good diagnosis disease individually.It will be better to fuse between doctors opinions.Tumor detection is the well known application.We find also image processing applications [25,21], classification [22,24], remote detection, artificial intelligence etc.The means of combination are multiple.We call it uncertain theories.We find vote theory, possibility theory, probability theory and belief function theory.The latter shows robustness in front of uncertainty and imprecision problems.The theory is invented by Dempster in 1967 and resumed by Shafer.It is also called Dempster-Shafer theory [1] [19].Belief function theory models beliefs in an event by a function called mass function.We note by m j the mass function of the source S j .It is defined in the set 2 Θ , their values are in [0, 1] and verify the constraint: 2 Θ is the set of decision or class disjunctions C i if we talk about classification: The parts A of Θ having the condition m(A) > 0 are called focal elements.The set of focal elements is called kernel.m(A) is a measure of evidence allocated exactly to the hypothesis X ∈ A. Classes C i must be exclusive and not necessarily exhaustive.
Belief function theory measures imprecision and uncertainty by many functions such as credibility and plausibility.Credibility is the minimum belief.It takes account of the conflict between sources.Credibility is defined by: Plausibility function measures the maximal belief in X ∈ 2 Θ .We suppose that all decisions are complete so we are in a closed frame of discernment: X c is the complement of X.
To represent a problem by the concepts of belief functions theory, we should respect three steps: modelling, combination and decision.There's an intermediate step: discounting.It can be done before or after combination.It measures the reliability of sources.A reliability coefficient is used here noted by α.The first step is the most crucial.We must choose the suitable model to represent mass functions.It depends on the context and the application.It can be computed by many ways.We find essentially probabilistic and distance models.
For the second step: combination can be done by using different operators.The choice of the suitable operator depends on the context.Many hypotheses control the operator such as the independence and reliability of sources.We find many operators such as conjunctive, disjunctive and cautious operators or rules [14].The first suppose that sources are independent and reliable whereas the second suppose that one of both should be reliable.The cautious rule doesn't impose independence hypothesis for the sources.So it allows dependence and redundancy.This situation may be encountered in practice for example experts may share some information.Classifiers may be trained on the same learning sets or not separate ones.
The conjunctive combination fuse by considering the intersections between the elements of 2 Θ .It reduces imprecision of focal elements and increases belief in the elements where sources agree.If we have m mass functions to combine we have the following formula: For the cautious rule it is defined as following: m 12 is the information gained by the two sources S 1 and S 2 .It must be more informative than m 1 and m 2 .If we try to formalize this we have the following: m 12 ∈ S(m 1 ) ∩ S(m 2 ).S(m) is the set of mass functions more informative than m.To choose the most informative mass function we apply the least commitment principle (LCP).It is based on this principle: if several mass functions are compatible with some constraints the least informative one in S(m 1 ) ∩ S(m 2 ) should be selected.This element is unique it is the non dogmatic mass function (m(Ω) > 0) m 12 with the following weight function: w(A) is a representation of a non dogmatic mass function (simple mass function), it may be computed from m as follows: w(A) .
q is the commonality function defined as: To apply this principle some informational ordering between mass functions has to be chosen.Many orderings can be used such as q-ordering and w-ordering.The first one affirms that m 1 is q-more committed or informative than m 2 noted by if it verifies the following constraint: The second one is based on the conjunctive weight function: m 1 is w-more committed than m 2 (noted by m 1 w m 2 ) if it verifies the following constraint: After calculating the mass functions and combining, we obtain the masses relative to the different elements of the frame of discernment.We must take a decision or affect a class if we have to classify at the end.It is made by using a criteria.Criterion are multiple.We mention maximum of plausibility, maximum of credibility and pignistic probability.For the first criteria, we choose the singleton or class C i giving the maximum of plausibility.For an object or vector x, we decide C i if: This criteria is optimistic because the plausibility of a singleton measures the belief obtained if all disjunction masses are focused on this one.Second criteria chooses C i for x if it gives the maximum credibility: This criteria is more selective because credibility function gives the minimum belief committed to a decision.The third criteria is between the two criterion.It moves closer credibility and plausibility.For a class C i , the pignistic probability is defined as: |A| is the cardinality of A. The maximum of pignistic probability decide C i for an observation x if: This criteria is more adapted to a probabilistic context.In the next section, we present some works related to classification combination.

III. Related works
Many researches are done about fusion in classification.Most of them is about either clustering [4,5,17,12,13] neither classification [15,19].Some researches deploy combination to improve classification performance.Other one deploy fusion to construct a new classifier such as neural network based on belief function [7] or credal K − N N [6] or credal decision tree [8].
In [7], the study presents a solution to problems bound in bayesian model.Conditional densities and a priori probabilities of classes are unknown.They can be estimated from learning samples.The estimation is not reliable especially if the set of learning database is small.Moreover, it can not represent very well uncertainty connected to class membership of new objects.If we dispose of few labeled examples and we have to classify new object which is very dissimilar of other ones uncertainty will be big.This state of ignorance is not reflected by the outputs of statistical classifier.This situation is met in many real applications like medicine: diagnosis disease.So it tries to measure uncertainty bound to the class of the new object considering the information given by the learning data.Suppose that we have a new object to classify, we focus on his neighbors.They are considered as evidence elements or hypotheses about class membership of the new one.Masses are assigned to each class and for each neighbor of the new object to classify.The beliefs are represented by basic belief assignment and combined by Dempster-Shafer theory to decide to which class it belongs to.The study doesn't depend strongly of the number of neighbors.
In [8], decision tree (classifying two classes) are combined to solve multi-class problem using belief function theory.Classic decision trees bases on probabilities.They are not always suitable to some problems like uncertainty.Uncertainty of inputs and outputs can not be modelled very well by probability.Moreover, a good learning database is not always available.The research proposes an extension to a previous study dealing with decision tree solving two class problem.It is based on belief function.The new study aims to treat multi-class problem by combining decision trees (two class problem) using evidence theory.
In [11], two supervised classifier are combined which are Support Vector Machines and K-Nearset neighbors.Combination aims to improve classification performance.Each of them has disadvantages.SV M for example depends strongly on learning samples.It is sensitive to the noise and the intruder.K − N N is a statistical classifier.It is also sensitive to noise.A new hybrid algorithm is proposed to overcome the limits of both classifiers.Concerning the combination of clustering, many researches are done.In [4], a novel classifier is proposed based on a collaboration between many clustering techniques.The process of collaboration takes place in three stages: parallel and independent clusterings, refinement and evaluation of the results and unification.The second stage is the most difficult.Correspondence between the different clusters obtained by the classifiers is looked.Conflict between results may be found.An iterative resolution of conflict is done in order to obtain a similar number of clusters.The possible actions to solve conflicts are fusion, deletion and split of clusters.After that, results are unified thanks to vote technique.Combination was used to analyze multi-sources images.Fusion was needed because sources are heterogeneous.
In [18], many techniques of clustering collaboration are presented.It differs by the type of result.Result can be a unique partition of data or an ensemble of clustering results.For the first type of result, fusion techniques of classification are used.For the second, multi-objective clustering methods are used.They try to optimize simultaneously many criteria.At the end of process, the set of results is obtained.It is the best result that compromises between the criteria to be optimized.Concerning fusion between clustering and classification many researches deploy clustering in the learning phase of supervised classification [10,16,20].

IV. Fusion mechanism
This work is an improvement of a previous one.The former [9] was established to combine clustering and classification in order to improve their performance.Both has difficulties.For clustering, we have essentially problems of complex data and index validity.For classification, we have problem of lack of learning database.We used belief function theory to fuse.We respect the three steps of combination process: modelling, combination and decision.Our frame of discernment is: 2 Θ , Θ .= {q j ; j .= 1, . . ., n} where n number of classes q j found by the supervised classifier.For modelling step, both sources must give their beliefs in the classes.Unsupervised source gives as outputs clusters.The classes are unknown for it.How can the clustering source give their beliefs for it?To do that, we look for the similarity between classes and clusters.More the similarity is big more the two classifications agree with each other.Generally to measure similarity we use distance.If we try to measure distance between a cluster and a class, we will confront a big problem which is the choice of the best distance.We chose to look for the recovery between clusters and classes.More they have objects in common more they are similar.Concerning supervised source, we used probabilistic model of Appriou.Only singletons interested us.In the combination phase, we adopted the conjunctive rule.It works in the intersections of the elements of the frame of discernment.At the end, we must decide to which class belong each object.The decision is made by using a criteria.We decide following the pignistic criteria.It compromises between credibility and plausibility.To summarize the process, we have the followings: Step 1: Modelling Masses are computed for both sources supervised and unsupervised: • Clustering (unsupervised source): We look for the proportions of found classes q 1 , . . ., q n by the supervised classifier in each cluster [5,4].∀x ∈ C i with c the number of clusters found.The mass function for an object x to be in the class q j is as follows: where |C i | the number of elements in the cluster C i and |C i ∩ q j |, the number of elements in the intersection between C i and q j .Then we discount the mass functions as follows, ∀A ∈ 2 Θ by: The discounting coefficient α i depends on objects.We can not discount in the same way all the objects.An object situated in the center of a cluster is considered more representative of the cluster than another one situated in the border for example.The coefficient α i is defined as (v i is the center of cluster C i ): • Classification (supervised source): We used the probabilistic model of Appriou: q i the real class, α ij reliability coefficient of the supervised classification concerning class q j .Conditional probabilities are computed from confusion matrices on the learning database: Step 2: combination Use of conjunctive rule equation 4 Step 3: decision Use of pignistic criteria equation 15 Three improvements are aiming in the present paper: noise, missing data (uncertain data) and lack of learning database.In the previous work we have supposed that data are correct.To do so, we introduce certain modifications to the previous mechanism.To compute masses for the supervised source we keep Appriou's model 20,21,22.For the unsupervised source, we follow the next steps: Step 1: For each cluster C i , we combine supervised masses of the objects belonging to by the conjunctive rule: Thanks to that, we have an idea of the proportion of labels present in a cluster.What's the majority class and minority ones.
Step 2: We obtain c masses for each element A ∈ 2 Θ with c number of clusters obtained.We combine them by the conjunctive rule.We can view how the two classifications agree with each other.More the masses tend to 1 more they are not in conflict.Before combining, we discount masses using a reliability coefficient noted by degnet ik .
We obtain the faith in the elements of the frame of discernment.degnet ik is a measure of neatness of object x k relatively to cluster C i .Object x k may be clear or ambiguous for a given cluster.If it is in the center of a cluster or near to, it is considered a very good one.It represents very well the cluster.We can affirm that it belongs to only one cluster.If it is situated in the border(s) between two or many clusters it may not be considered as clear object for only one cluster.It is ambiguous.It may belong to more than only one group.The computation of degnet ik takes account of two factors: degree of membership to cluster C i and the maximal degree of overlapping in the present partition noted by S max .It is the maximal similarity in the partition (found by the clustering).
degoverl i is the overlapping degree to cluster C i .It is computed as follows: Degree of neatness is the complement to 1 of the degree of overlapping.It is composed of two terms: first one (1−µ ik ) measures the degree of not membership of a point x k to a cluster C i .Second one takes account of overlapping aspect.S max measures the maximal overlapping in the partition.It is computed as follows: The clusters C i and C j are considered as fuzzy not hard sets.

S(C
Similarity measure is not based on distance measure due to its limits.In fact, we can find two clusters having the same distance separating them but are not separable in the same way.It is based on membership degree.We look for the degree of co-relation between two groups.What's the minimum level of co-relation guaranteed.The new measure satisfies the following properties: Property 1 1: S(C i , C j ) is the maximum degree between two clusters.Property 2: The similarity degree is limited, 0 ≤ S(C i , C j ) ≤ 1 Property 3: . = 0.4 so the two clusters are similar or in relation with minimum degree of 0.4.They are not connected with a degree of 0.6.

S max
. = max( max The degree of membership of an object x k to a cluster C i is calculated as follows: where v i the center of cluster C i , n 1 number of objects.For the combination phase, we use the cautious rule 5. Sources are not totally independent because computation of masses for the unsupervised source is based on classes given by supervised sources.So, we can not say that they are independent.At the end, we decide using the pignistic probability.We are interested only in singletons: labels given by the classification.To summarize the process of fusion we illustrate that by the following figure:

V. Experimental study
In this section, we present the obtained results for our fusion approach between supervised classification and unsupervised classification.We conduct our experimental study on different databases coming from generic databases obtained from the U.C.I repository of Machine Learning databases.We did experiments on data.We make data missing.Also, we inject noise with different rates and we take a little sampling database (10%).The aim is to demonstrate the performance of the proposed method and the influence of the fusion on the classification results in a noisy environment and with missing data.The experience is based on three unsupervised methods such as the Fuzzy C-Means (FCM), the K-Means and the Mixture Model.For the supervised methods, we use the K-Nearest Neighbors, credal K-Nearest Neighbors, Bayes, decision tree, neural network, SVM and credal neural network.We show in the Table 2 the obtained classification rates before and after fusion for the new mechanism.The data shown are: Iris, Abalone, Breast-cancer, car, wine, sensor-readings24 and cmc.The first ones (before fusion) are those obtained with only supervised methods (K-Nearest Neighbors, credal K-Nearest Neighbors, Bayes, decision tree, neural network, SVM and credal neural network).The learning rate is equal to 10%.We show in the Table 3 the obtained classification rates before and after fusion for the new mechanism for missing data.The data shown are: Iris, Abalone, wine, sensor-readings24 and cmc.The first ones (before fusion) are those obtained with only supervised methods (K-Nearest Neighbors, credal K-Nearest Neighbors, Bayes, decision tree, SVM and credal neural network).The learning rate is equal to 10%.We show in the Tables 4, 5, 6, 7, 8 and 9 the obtained classification rates before and after fusion for the new mechanism in a very noisy environment.We vary the noise levels.We show results obtained with the following levels: 55%, 65% and 70% respectively for: IRIS, Abalone, Yeast, wine, sensor-readings4 and sensor-readings2.

A. Experimentation
The number of clusters may be equal to the number given by the supervised classification or fixed by the user.The tests conducted are independent for the three levels of noise.It means that they were not made in the same iteration of the program.In the following, we present the data (table 1) and the results obtained (tables 2, 3, 4,5, 6, 7, 8 and 9).

B. Discussion
If we look to the results shown in table 2. We remark the following results for each data: 1. Iris The performance obtained after fusion are equal to 100% exception are for decision tree and neural network no improvement.The classification rate is approximately 66%.

Abalone
The performance obtained after fusion are better than that obtained before fusion exception is for decision tree no improvement.The classification rate is 31.28%.The best result obtained is for KNN with mixture model 97.58%.

Breast-cancer
The performance obtained after fusion are equal to 100% (KNN, Bayes, decision tree, neural network, credal KNN) exception are for SVM and credal neural network.The classification rate is approximately 65%.

car
The classification rate after fusion is better for most cases equal to 100% (KNN and credal KNN), 96% (Bayes), 92% (Decision tree).For SVM, neural network and credal neural network the performance is less than that before fusion equal to 70%.

sensor-readings24
The classification rates obtained after fusion are equal to 100% (KNN, Bayes, decision tree, credal KNN, SVM) and to 99% for neural network.
For the results obtained with missing data, we remark the following in table 3: 1. Iris Classification rates after fusion are excellent equal to 100% (KNN, Bayes, neural network, SVM, credal KNN, credal neural network) and to 64% for decision tree.

wine
The best classification rate was obtained for Bayes (100%).For credal KNN, decision tree, credal neural network no improvement was remarked.
• Noise level equal to 70%: A little improvement was remarked for KNN and credal KNN 36% and 31% for credal Neural Network.

Yeast
• Noise level equal to 55%: Only the combinations of KNN has improved the performance of classification 61% while it was of 26% before fusion.
• Noise level equal to 65%: No improvement was remarked for the combination of credal neural network while it is for KNN and credal KNN.A good result equal to 76%.
• Noise level equal to 70%: All combinations have improved performance.The three tests are independent.
• Noise level equal to 65%: We reach a performance rate of 100% for KNN and credal Neural Network and of 85% for credal KNN.
• Noise level equal to 70%: KNN has perfectly improved performance.We have 100%.Credal KNN has 78%.No improvement was remarked for credal neural network.

sensor-readings2
• Noise level equal to 55%: We have a perfect performance 100% for all combinations.
• Noise level equal to 65%: same remark as before (100% for all combinations).
• Noise level equal to 70%: All combinations have improved performance.

VI. Conclusion
This paper proposes a new approach which improves a previous work [9].The goal is to construct a fusion mechanism robust to noise,lack of sampling data and missing data.The former fused both types of classification in order to overcome limits and problems linked to.It improved the performance of classification.It was based on belief function theory.The new one is based also on the same theory.Modifications were made in the phase of modelling and combination.We changed the way of calculus of masses for the unsupervised source.We modified the conjunctive rule by the cautious rule.In fact both sources are not independent.Computation of masses for the clustering is based on classes carried out by the supervised classifier.We made our experimentation in three steps.The first step, experiments are done without noise.The second one was conducted with missing data.The last one was made with different levels of noise.We showed in this paper big levels: 55%, 65% and 70%.The sampling rate is little 10%.The three tests are independent.The results obtained are good.In most cases, combinations are the best.This work can be spread by studying dependence issue deeply.

Table 1 .
Data characteristic NbA: Number of attributes, NbC: number of classes, NbCl: number of clusters tested

Table 2 .
Classification rates obtained before and after fusion

Table 3 .
Classification rates obtained before and after fusion for missing data

Table 4 .
Classification rates obtained for Iris with K − N N , credal K − N N and Credal Neural Network with FCM, K-Means and Mixture Model before and after fusion with different noise rates (55%, 65%, 70%)

Table 5 .
Classification rates obtained for Abalone with K − N N , credal K − N N and Credal Neural Network with FCM, K-Means and Mixture Model before and after fusion with different noise rates (55%, 65%, 70%)

Table 6 .
Classification rates obtained for Yeast with K − N N , credal K − N N and Credal Neural Network with FCM, K-Means and Mixture Model and after fusion with different noise rates (55%, 65%, 70%)

Table 7 .
Classification rates obtained for Wine with K − N N , credal K − N N and Credal Neural Network with FCM, K-Means and Mixture Model before and after fusion with different noise rates (55%, 65%, 70%)

Table 8 .
Classification rates obtained for sensor-readings4 with K −N N , credal K −N N and Credal Neural Network with FCM, K-Means and Mixture Model before and after fusion with different noise rates (55%, 65%, 70%)

Table 9 .
Classification rates obtained for sensor-readings2 with K −N N , credal K −N N and Credal Neural Network with FCM, K-Means and Mixture Model before and after fusion with different noise rates (55%, 65%, 70%) K − N N +K-Means 100.00 100.00 59.65 K − N N +Mixture Model 100.00 100.00 59.65