Speaker specific feature based clustering and its applications in language independent forensic speaker recognition

Received Dec 5, 2018 Revised Dec 18, 2019 Accepted Jan 11, 2020 Forensic speaker recognition (FSR) is the process of determining whether the source of a questioned voice recording (trace) is of a specific individual (suspected speaker). Most existing methods measure inter-utterance similarities directly based on spectrum-based characteristics, the resulting clusters may not be well related to speaker’s, but rather to different acoustic classes. This research addresses this deficiency by projecting languageindependent utterances into a reference space equipped to cover the standard voice features underlying the entire utterance set. Then a clustering approach is proposed based on the peak approximation in order to maximize the similarities between language-independent utterances within all clusters. This method uses a K-medoid, Fuzzy C-means, Gustafson and Kessel and Gath-Geva algorithm to evaluate the cluster to which each utterance should be allocated, overcoming the disadvantage of traditional hierarchical clustering that the ultimate outcome can only hit the optimum recognition efficiency. The recognition efficiency of K-medoid, Fuzzy C-means, Gustafson and Kessel and Gath-Geva clustering algorithms are 95.2%, 97.3%, 98.5% and 99.7% and EER are 3.62%, 2.91 %, 2.82%, and 2.61% respectively. The EER improvement of the Gath-Geva technique based FSRsystem compared with Gustafson and Kessel and Fuzzy C-means is 8.04% and 11.49% respectively.


INTRODUCTION
Speaker recognition is the general term used to include all the many different tasks of discrimination based on the sound of their voices between one person and another [1]. Forensics means the use of science or technology in investigating and finding in the court of law facts or evidence. The role of forensic science is to provide information (in fact or opinion) to assist investigators and law courts in answering questions of importance. Forensic speaker recognition is the method of determining whether the origin of a questioned voice recording (trace) is a particular person (suspected speaker). This process involves comparing an unidentified voice recording (questioned recording) with one or more recordings of a known voice (the alleged speaker's voice) [1]. Forensic Automatic Speaker Recognition (FASR) is an established term used in the adaptation of automatic speaker recognition methods to forensic applications. For automated speaker identification, the deterministic or predictive models of the voice of the speaker's acoustic characteristics are contrasted with the acoustic characteristics of the recordings for question [1].
The clustering of speaker's refers to the function of grouping together unidentified speech expressions based on the voice characteristics of a speaker. The concerns and needs of the speaker recognition community have been a major motivation for the research on speaker clustering for more than Int J Elec & Comp Eng ISSN: 2088-8708  Speaker specific feature based clustering and its applications in language independent … (Satyanand Singh) 3509 a decade [2,3], in which the main purpose is to band together speech data generated by the same speaker or speaker's with similar voices so that the adaptation of acoustic models can be more effectively carried out. Because speech clustering simply serves as a supplementary process in speech recognition, however, there is still a dearth of studies dedicated to this subject. More recently, speaker-clustering work has experienced a renaissance [4], powered by research into spoken document indexing to handle burgeoning collections of accessible voice data. The main purpose of such an emerging research topic is that the human effort needed for documentation can be dramatically reduced by grouping speech data from the same speaker's. Speaker clustering can be described as an unsupervised speaker-recognition problem in which the speaker recognition process [5] is concerned with determining a speaker recognition or whether a speaker is who he/she claims. Contrary to the traditional speaker-recognition approach, however, which assumes that some contextual information or speech details is accessible and can be modeled on the speaker's concerned, speaker clustering will function without any awareness of who the potential speaker's are and how many are involved in the language to be clustered. Solutions to the speaker-clustering issue should therefore be able to extract and compare the speech characteristics that underlie the utterance collections in an unattended manner. Contrary to the traditional speaker-recognition approach, however, which assumes that some contextual information or speech details is accessible and can be modeled on the speaker's concerned, speaker clustering will function without any awareness of who the potential speaker's are and how many are involved in the language to be clustered. Solutions to the speaker-clustering issue should, therefore, be able to extract and compare the speech characteristics that underlie the utterance collections in an unattended manner. A similar activity is a segmentation of speaker's [6], which seeks to identify the boundaries when a speaker change occurs in an audio stream containing the speech expressions of multiple people. Together with speaker clusters, the segmentation of speaker's breaks the continuous input into discrete statements that are easy to process in speech/speaker recognition and is, therefore, an essential step in the indexing of spoken documents. Speaker segmentation can be accomplished from another angle with the help of speaker clustering. There may be a shift in speaker's between two adjacent short regions with different cluster indices.
Most speaker-clustering methods currently follow a hierarchical clustering framework, consisting of three main components: computing inter-utterance similarities, generating a cluster tree and determining the number of clusters. The similarity equation was designed to generate higher values for similarities between the same speaker's utterances and lower values for similarities between different speaker's utterances. Many similarity tests, such as the Kullback Leibler (KL) distance [6][7][8], the cross probability ratio (CLR) [8], and the generalized probability ratio (GLR) are analyzed and contrasted in many works of literature. A cluster tree is created either in a bottom-up (agglomerative) or a top-down (divisive) fashion, according to some criteria derived from the measure of similarity. The bottom-up method starts as a single cluster with each utterance and then merges the clusters in a fair manner until all the utterances are found in one cluster. Nevertheless, all the utterances start in a single cluster in the top-down method, and the clusters are separated successively until each cluster has exactly one utterance. The resulting cluster tree is then split to maintain the best partition by estimating the number of clusters. Representative methods are based on the BBN Metric and the Bayesian Information Criterion to estimate the optimum number of clusters [8,9].
In addition to developing a more accurate measurement of inter-utterance similarity, we also investigate how the clusters can be optimally produced so that all the utterances within the cluster are from the same speaker. Conventional methods based on either top-down or bottom-up hierarchical clustering use a nearest neighborhood selection rule to decide which pronouncements should be assigned to the same class. However, the closest neighborhood selection rule is applied in a cluster-by-cluster manner during the procedure of splitting one cluster or merging two clusters, rather than in a global manner that considers all the clusters. Consequently, hierarchical clustering can only make each individual cluster as homogeneous as possible, but the ultimate goal of maximizing overall homogeneity can not be achieved. To solve this problem, we are proposing a new clustering approach specifically aimed at maximizing the total number of statements from the same speaker's within the cluster. This is achieved by estimating the so-called cluster purity in combination with a genetic algorithm-based optimization process [10] to find the best utterance partitioning to achieve maximum cluster purity.

FUNDAMENTAL OF SPEAKER SPECIFIC FEATURE BASED CLUSTERING ANALYSIS
The goal of cluster analysis is to categorize objects based on similarities between them and to organize data into groups. Among the unsupervised approaches, clustering techniques do not use prior class identifiers. Different classifications may be associated with the clustering techniques algorithmic approach. It is possible to distinguish partitioning, hierarchical, graph-theoretical methods and methods based on the objective function.

Speaker specific training data
It is possible to apply clustering techniques to information that is quantitative (numerical), qualitative (categorical), or a mixture of both. The clustering of quantitative data is being applied in language independent FSRsystem. Specific data from the speaker are typically physical process observations. Each observation of the speaker consists of measured variables grouped into a n-dimensional line vector = [ 1 , 2 , … … . . , ] , ∈ . X denotes a set of N observations and is represented as a matrix of NXn: In the language of pattern recognition, X rows are called patterns or objects, the columns are called the characteristics or attributes, and X is called the pattern matrix. X is often referred to as the data matrix in this study. The significance of X's columns and rows in relation to reality depends on the context. For example, the rows of X may represent a speaker in application, and the columns are speaker specific measurements. As clustering is applied to dynamic system modeling and classification, the X rows contain time signals measurements, and the columns are, for example, speaker specific variables observed in the system (sentiment, emotion, ethnicity, etc.). The purpose of clustering in language independant forensic application is to find relationships between language independent system variables, called regressors, and speaker specific feature dependent variables values, called regressands [11].

Clustering algorithms in FSR application
Based on the purpose of clustering, various concepts of a cluster can be formulated. In general, the view that a cluster is a group of objects that are more similar to each other than members of other clusters can be embraced. In some well-defined sense, the term similarity should be understood as mathematical similarity. The similarity is often described in metric spaces by means of a distance standard. Distance to some cluster prototypical object can be measured between the data vectors themselves, or as a distance form a data vector.The prototypes are usually not known in advance and are searched simultaneously with data partitioning by the clustering algorithms. The models may be vectors of the same size as the data objects, but they may also be described as geometrical "higher-level" objects, such as linear or non-linear subspaces or functions.

Application of K-means and K-medoid algorithms
The methods of hard partitioning are simple and popular, although their results are not always reliable and these algorithms also have numerical problems. K-means and K-medoid algorithms allocate each data point to one of the c clusters from a NXn dimensional data set to minimize the sum of squares within the cluster: where is a set of objects in the I cluster (data points) and is the mean over cluster I for that point in (2) in fact denotes ac distance standard. The cluster prototypes are called in K-means clustering , i.e. the cluster centers: where is an entity number in . The cluster centers are the nearest objects to the mean of information in one cluster = { ∈ |1 ≤ ≤ } in K-medoid clustering.

Application of fuzzy C-means algorithm
The clustering algorithm for Fuzzy C-means is based on minimizing an objective function called functional C-means. Dunn defines it as: . , ], ∈ is a cluster model vector (centers) to be established and 2 = ‖ − ‖ 2 = ( − ) ( − ) is a squared inner-product distance norm.  (4) is the maximum variance of from can be seen as an indicator. Minimizing the functional c-means from as shown in (3), and (4) is a nonlinear problem of optimization that can be solved by using a variety of methods available, ranging from grouped coordination minimization to simulated annealing to genetic algorithms. Nevertheless, the most popular method is a simple Picard iteration, known as the fuzzy c-means (FCM) algorithm, through the first-order conditions for stationary points of (4). Using Lagrange multipliers, the stationary points of the objective function of (4) can be identified by applying the limit ∑ =1 = 1, 1 ≤ ≤ to J: (5) Figure 1 shows the results of K-medoid algorithm of four speaker's of 6 sec voice data with normalization.
the remaining constraints are also fulfilled by this solution. Note that (7) gives vi as the weighted mean of the data items belonging to a cluster where the weights are the degrees of membership.
The FCM algorithm is based on the standard Euclidean distance standard, which induces hyperspheric clusters. Therefore, clusters with the same form and orientation can only be identified, because the typical choice of norm inducing matrix is: = or it can be chosen as a diagonal matrix which accounts for different variances in the direction of the X coordinate axes:  Figure 2 shows the results of Fuzzy C-means algorithm of four speaker's of 6 sec voice data with normalization.

Application of gustafson and kessel algorithm
By using an adaptive distance method, Gustafson and Kessel expanded the standard fuzzy c-means algorithm to detect clusters of various geometric shapes in one array of data [12].
The matrices are used in the c-means functional as optimization variables, allowing the cluster to adapt the distance norm to the data's local topological structure. Let A denote a c-tuple of matrices that induce norm: = ( 1 , 2 , 2 , … ). The GK algorithm's functional objective is defined by: conditions ∈ [0,1], 1 ≤ ≤ , 1 ≤ ≤ , ∑ = 1, 1 ≤ =1 ≤ and 0 < ∑ < , 1 ≤ ≤ =1 may be applied directly to a fixed . The objective function (9) in relation to however, cannot be directly minimized as it is linear in . This means that by simply making less optimistic, can be rendered as low as desired. must be constrained in some way in order to find a feasible solution. The usual way to do this is to limit determinant. Allowing the matrix to differ with its defined determinant means optimizing the shape of the cluster while staying constant in volume ‖ ‖ = , > 0. Where for every cluster is set.
The following expression for is obtained using the Lagrange multiplier method as Where is the cluster's fuzzy covariance matrix defined by: Remember that replacing = [ ( )] 1 −1 and (11) with (9) gives a generalized square Mahalanobis distance norm between xk and the mean cluster where the covariance is weighted by the membership degrees. Figure 3 shows the results of the Gustafson and Kessel clustering algorithm of four speaker's of 6 sec voice data with normalization.

Application of gath-geva algorithm
The clustering algorithm fuzzy maximum likelihood (FML) uses a distance standard based on the fuzzy maximum likelihood Estimates suggested by Bezdek and Dunn [13]: notice that this distance norm includes an exponential concept, contrary to the Gustafson and Kessel algorithm, and therefore decreases faster than the internal product standard. F wi denotes the fuzzy covariance matrix of ith cluster, given by: where w=1 is used in the original FML algorithm, but the w=2 weighting exponent is used to make the partition more fuzzy to compensate for the exposure of the distance standard. The difference between the matrix F i in Gustafson and Kessel algorithm and the F wi mentioned above is that the latter does not include the weighting exponent m, but consists of w=1 instead. This is because the two weighted matrices of covariance derive from two different concepts as generalizations of the classical covariance. The α i is the prior probability of choosing cluster i, that is given by: According to the speaker data point the membership degrees are interpreted as the posterior probability of selecting the cluster. Gath and Geva [12] stated that clustering algorithms were capable of detecting clusters of varying shapes, sizes and densities from the fuzzy maximum likelihood estimates. The cluster covariance matrix is used in combination with an "exponential" length, and there is no volume constraint on the clusters. This algorithm, however, is less robust in the sense that it needs a good initialization because it converges to a near-local optimum due to the exponential distance norm. Figure 4 shows the results of the Gustafson-Kessel clustering algorithm by three speaker's of 6 sec voice data.  -Partition Index (SC): is the compactness maximum ratio and cluster separation ratio. It is a total of uniform individual cluster validity tests by dividing each cluster's fuzzy cardinality [14], . When comparing different partitions with the same number of clusters, SC is useful. A lower SC value shows a better partition. -Separation Index (S): unlike partition index (SC), the separation index uses a partition validity minimum range separation [14], -Xie and Beni's Index (XB): aims at quantifying the ratio of total cluster variation and cluster separation, The optimum number of clusters should minimize the index value.
-Dunn's Index (DI): it was originally suggested that this index be used to classify "compact and well separated clusters." It is therefore important to recalculate the outcome of the clustering as it was a hard partition algorithm, }}.
-Alternative Dunn Index (ADI): the purpose of modifying the original Dunn index was to simplify the calculation when the difference between two clusters works, }}. Note, the only difference between SC, S and XB is the cluster separation process. Due to re-partitioning the results with the hard partition process, the values of DI and ADI are not consistent in the case of overlapped clusters. Figure 5 shows the results of the Gustafson-Kessel clustering algorithm by three speaker's of 6 sec voice data. Table 1

KERNEL-BASED SPEAKER SPECIFIC FEATURE EXTRACTION ANDT ITS APPLICATION
Classification algorithms must represent the objects to be classified as points in a multidimensional feature space. However, one can apply other vector space transformations to the initial features before Int J Elec & Comp Eng ISSN: 2088-8708  Speaker specific feature based clustering and its applications in language independent … (Satyanand Singh) 3515 running the learning algorithm. There are two reasons for doing this. First, they can improve the performance of classification and second, they can reduce the data's dimensionality. The selection of initial features and their transformation are sometimes dealt with in the literature under the title "feature extraction". To avoid misunderstanding, this section describes only the latter and describes the first feature set. Hopefully it will be more effective and classification will be faster. The approach to the extraction of features may be either linear or nonlinear, but there is a technique that breaks down the barrier between the two forms in some way.
The key idea behind the kernel technique was originally presented in [15] and applied again in connection with the general purpose SVM [16,17,18] followed by other kernel-based methods.

Supplying input variable information into kernel PCA
Additional information to the KPCA representation for interpretability. We have developed a process to project a given input variable into a subspace spanned by feature vectors ̃= ∑̃̃( 1 ) =1 .
We can think of our observation as a random vector = ( 1 , 2 , … . . , ) implementation then to represent the prominence of the input variable in the KPCA. Considering a set of points of mathematical forms = + ∈ ℝ where = (0, … . ,1, … . ,0) of kth component is either 0 or 1. Next, the projection points ( ) of these images onto the subspace spanned by the feature vector ̃= ∑̃̃( 1 ) =1 can be calculated. Considering (12) the row vector gives the induction curve in the Eigen space expressed in matrix form: furthermore, by projecting the tangent vector to s = 0, we can express the maximum change direction of ( ) associated with the variable . Matrix form of the expression represented as follows: Where the training point = . Thus, by applying (13), it is possible to locally represent any given input variable plot in KPCA. Furthermore, by using (14), it is possible to represent the tangent vector associated with any given input variable at each sample point [19]. Therefore, a vector field can be drawn on KPCA indicating the growth direction of a given variable. There are some existing techniques to compute z for specific kernels [20]. For a Gaussian kernel ( , )) = (−‖ − ‖ 2 /2 2 ), must satisfy the following condition.

EXPERIMENTAL SETUP
To evaluate the efficiency of kernel-based speaker-specific feature extraction techniques, a language independent utterances recognition experiment was performed. The experiment includes 520 Japanese words from the ATR Japanese C language set Voice database, 80 speaker's (40 men and 40 Female). Audio samples of 10 iTaukei speaker's were collected at random and under unfavourable conditions. The average duration of the training samples was six seconds per speaker for all 10 speaker's and out of twenty utterances of each speaker just one was used for training purpose [21]. For matching purposes the remaining 19 voice samples were used from the corpus. We have recorded utterances for this investigation were at one sitting for  [22].
Speech features, each consisting of 24 Mel-scale frequency cepstral coefficients (MFCCs), were extracted from these utterances for every 20-ms Hamming-windowed frame with 10-ms frame shifts. Prior to MFCC computation, voice active detection was applied to remove salient non-speech regions that may be included in an utterance [23,24]. Our basic strategy is to create an utterance-independent GMM using all the utterances to be clustered, followed by an adaptation of the utterance-independent GMM for each of the utterances using maximum a posteriori (MAP) estimation [25]. This technique comes from the GMM-UBM strategy [26] for FSR in which the necessary speaker-explicit models are made by tuning the parameters of a widespread speaker model pre-prepared by utilizing discourse information from numerous speaker's. We are using Language independent Gaussian mixture modeling followed by MAP adaptation in language independent forensic speaker recognition.
Throughout the experiment, 10400 utterances were used as training data and the remaining 31,200 utterances were used as test data. The sampling rate of the audio signal is 16 kHz. 13 Mel-Cepstral coefficients extracted using 25.6 ms Hamming windows with 10 ms shifts. Figure 6 show the equal error rate (EER) of K-medoid, Fuzzy C-means, Gustafson and Kessel and Gath-Geva clustering algorithms Geva for 6 sec of voice data of ATR Japanese C language. The forensic speaker recognition efficiency of of K-medoid, Fuzzy C-means, Gustafson and Kessel and Gath-Geva clustering algorithms are 95.2%, 97.3%, 98.5% and 99.7% and EER are 3.62%, 2.91%, 2.82% and 2.61% respectively. The EER improvement of Gath-Geva technique based FSRsystem compared with Gustafson and Kessel and Fuzzy C-means is 8.04% and 11.49% respectively. Table 2 illustrate efficiency and EERof the FSRsystem for of K-medoid, Fuzzy C-means, Gustafson and Kessel and Gath-Geva clustering algorithms respectively for ATR Japanese C language. Figure 7 show the equal error rate (EER) K-medoid, Fuzzy C-means, Gustafson and Kessel and Gath-Geva clustering algorithms Geva for 6 sec of voice data of 10 iTaukei speaker's cross language.The FSRefficiency of of K-medoid, Fuzzy C-means, Gustafson and Kessel and Gath-Geva clustering algorithms are 93.2%, 96.6%, 97.7% and 98.8% and EER are 4.23%, 3.42%, 3.33% and 3.11% respectively. The EER improvement of Gath-Geva technique based FSRsystem compared with Gustafson and Kessel and Fuzzy C-means is 7.07% and 9.96% respectively. Table 3 illustrate Efficiency and EERof the FSRsystem for of K-medoid, Fuzzy C-means, Gustafson and Kessel and Gath-Geva clustering algorithms respectively for 10 iTaukei speaker's cross language.
In addition, to decide how many clusters should be produced, the clustering method has been integrated with the Bayesian information criterion. Experimental results show that the number of clusters automatically determined can approximate the actual population size of the speaker. An experimental evaluation of the performance of the forensic recognition system efficiency of K-medoid, Fuzzy C-means, Gustafson and Kessel and Gath-Geva clustering algorithms are 95.2%, 97.3%, 98.5% and 99.7% and EER are 3.62%, 2.91%, 2.82% and 2.61% respectively. The EER improvement of Gath-Geva technique based FSRsystem compared with Gustafson and Kessel and Fuzzy C-means is 8.04% and 11.49% respectively.   Figure 7. Equal error rate of K-medoid, fuzzy C-means, gustafson and kessel and gath-geva for 6 sec of voice data of itaukei speaker's cross language

CONCLUSION
This study examined methods for improving the measurement of inter-utterance similarity for speaker clustering. The relationships of similarity between the utterances to be clustered can be exploited more easily and efficiently by using a voice characteristic reference space. We presented several implementations for the development of reference spaces. In particular, in order to capture the most representative characteristics of the voices of speaker's, the reference space was represented as a set of eigenvectors obtained by applying the technique of self-voice to the set of expressions to be clustered. This has resulted in a significant improvement in speaker-clustering performance compared to the traditional inter-speech similarity measure based on the generalized likelihood ratio. However, we have researched the method for creating clusters outside traditional hierarchical clustering so that all within-cluster utterances can be from a single speaker as far as possible. This criterion was conceived as a problem of calculating and optimizing the overall purity of the cluster. By representing cluster purity as a function of inter-utterance similarity and applying the genetic algorithm to find a solution to this problem, we have demonstrated a further increase in speaker-clustering efficiency compared to conventional agglomeration hierarchical clustering.