Combination of texture feature extraction and forward selection for one-class support vector machine improvement in self-portrait classification

This study aims to validate self-portraits using one-class support vector machine (OCSVM). To validate accurately, we build a model by combining texture feature extraction methods, Haralick and local binary pattern (LBP). We also reduce irrelevant features using forward selection (FS). OCSVM was selected because it can solve the problem caused by the inadequate variation of the negative class population. In OCSVM, we only need to feed the algorithm using the true class data, and the data with pattern that does not match will be classified as false. However, combining the two feature extractions produces many features, leading to the curse of dimensionality. The FS method is used to overcome this problem by selecting the best features. From the experiments carried out, the Haralick+LBP+FS+OCSVM model outperformed other models with an accuracy of 95.25% on validation data and 91.75% on test data. the


INTRODUCTION
Currently, identity verification is performed on most types of digital transactions. Verification is done by uploading identity documents such as ID cards, passports, self-portraits, and others. One of the critical pieces of data in the identity verification process is a self-portrait. In Indonesia, self-portraits are still widely used as proof of a person's identity, such as registration for selecting prospective civil servants (CPNS), joint selection to enter state universities (SBMPTN), or e-commerce account verification. The accepted self-portrait generally has specific terms and conditions such as shooting position, background, and photo quality.
However, in reality, any form of self-portrait can be uploaded even if it does not meet the predetermined standards, thus reducing the efficiency of identity verification. This happens because there is no self-portrait validation process. Self-validation is needed to facilitate the face verification process from identity documents and faces in real-time. The literature on selfie portraits and photo documents have been carried out by [1]- [4]. Most of them use the convolutional neural network (CNN) approach. CNN is good at generating performance, but some works still adopt several other algorithms to get optimal results. However, CNN also requires significant computational resources. This process can be simplified by involving efficient machine learning methods.
This process can be simplified by involving machine learning algorithms. One of the algorithms is the one-class support vector machine (OCSVM). This algorithm can solve the problems faced because the negative class population cannot be appropriately represented due to the many variations of self-portrait errors. In OCSVM, these error variations are classified as anomalies, as has been done by [5]- [8]. OCSVM was proposed by [9] and is an adaptation of the support vector machine (SVM) methodology. Like the basic concept of SVM, OCSVM involves kernel functions to perform classification, including linear, polynomial, radial basis function (RBF), and sigmoid. OCSVM kernel comparison has been carried out by [10], the results show that RBF consistently provides the best performance.
In performing self-portrait classification, feature extraction steps are needed. Feature extraction generates features that are used to describe the content [11]. Image extraction is divided into several categories, namely color, texture, and shape feature extraction. The texture is a key element of human visual perception widely used in computer vision systems [12]. Some examples of texture feature extraction methods are Haralick and local binary patterns. In the study [12]- [14], Haralick feature extraction resulted in excellent classification performance, as well as the use of local binary pattern (LBP) carried out by [15]- [17]. In addition, Kaplan et al. [18] and Porebski et al. [19] states that efficient texture feature extraction in predicting sample variation is the Haralick feature and local binary pattern.
While they deliver good performance, Haralick and LBP produce a large number of features. The large number of features tends to reduce the prediction accuracy of the classification model [20]. This problem can be overcome by minimizing dimensions or irrelevant features, one of which is feature selection. Feature selection can improve machine learning model performance and reduce computation time [21], [22]. In addition, forward selection (FS) is utilized to perform feature selection and results in improved classification performance [22]- [25].
This study establishes a model based on OCSVM and applies it to simulate the self-portraits validation. In addition, this research is also intended to improve OCSVM performance by combining Haralick and local binary patterns, then reducing irrelevant features using forward selection. This paper is structured as: The second section describes the procedures and methodologies applied to our study. The third section covers the results achieved in our experiment, evaluation, and analysis. In the last section, the main findings of this study are highlighted and discussed.

METHOD
The following is the research procedure carried out. Figure 1 presents the block diagram of the various sub-stages in step by-step manner. The sub-stages are discussed in detail as shown in Figure 1.

Dataset
This study uses student self-portrait datasets taken from the Lambung Mangkurat University Academic Portal application. Self-portraits are divided into 2 labels, namely true and false self-portraits. Selfportrait is considered correct if the picture is taken from the front, the position of the object is symmetrical, the background is monotonous, and the image quality is clear.

Feature extraction
Before classification, the dataset was extracted using texture feature extraction. The texture feature extraction used is Haralick and local binary pattern. In this study, feature extraction was carried out without pre-processing. The feature extraction process is implemented in Python and the Mahotas library.

Haralick
Haralick feature extraction was proposed by Haralick et al. [26]. The Haralick extraction result consists of 14 features calculated from the gray level co-occurrence matrix (GLCM). This method calculates the feature value of the 4 GLCM angles and takes the average value. GLCM is generated from the probability relationship of 2 pixels ( , ) with distance d, angle θ (0, 45, 90, and 135), and color level N. The formula can be seen in (1)- (14).
− Inverse difference moment − Sum average − Sum entropy − Information measure of collection 2 − Maximum correlation coefficient, the square root of Q, where:

Local binary pattern
LBP feature extraction was first proposed by Ojala et al. [27]. LBP is calculated from the grayscale image. The LBP calculation begins with the localization of pixels determined from the sampling point p on a circle of radius r as shown in Figure 2. Meanwhile, the principle of calculating LBP is shown in Figure 3. This method compares the value of the center pixel with the values of the pixels around it. If the intensity of the center pixel is greater than the center pixel, the value is set to 1, if it is smaller than 0. The value after thresholding is multiplied by the weight of each pixel and the additive result is the LBP value.

Combining extraction results
After feature extraction, the next step is to combine the results of Haralick

Data mining
Data mining modeling is done using Rapidminer. Classification is performed on the dataset generated in the previous step. Each is classified with OCSVM, with or without forward selection.

One class support vector machine
Based on the research proposed by [9], OCSVM aims to find the best hyperplane to separate the target data from the origin/second class. The hyperplane is affected by the ( ) parameter. OCSVM uses a kernel trick to map data into a high-dimensional space. Kernels used in OCSVM include linear, polynomial, RBF, and sigmoid. This research uses the RBF kernel. The OCSVM is formulated in (15): , , where n is number of data points, ν is regularization parameter, ξi is slack variable for point xi that allows it to be placed outside the decision boundary, and ω and ρ is parameters that determine the decision boundary.

Forward selection
Classification is performed on the dataset generated in the previous step. Each is classified as OCSVM, with or without forward selection. Forward selection will select the most influential attribute and remove irrelevant attributes. How the forward selection works starts from the empty model, then one by one the attributes are entered until certain criteria are met.

Evaluation
The performance of the model is evaluated based on its accuracy. After the performance of each model is obtained, the next step is to make a comparison. Comparison is made by comparing the results of the accuracy of the proposed model (Haralick+LBP→FS+OCSVM) with other models to prove that the proposed model provides increased accuracy.

Dataset
Student self-portraits are obtained from the ULM Student Academic Portal application. The dataset contains 59,000 photos from the student intake between the years 2013 and 2020. Of these, we took 2,400 photos labeled true and false. An example of data labeling can be seen in Table 1. After being labeled, the data is then divided into 3 categories, namely training, validation, and testing data. Details of the amount of data can be seen in Table 2.

Feature extraction
After the photo data is obtained, the next step is to perform feature extraction with Haralick and LBP. To perform this extraction, we use the Python programming language with the Mahotas library. In Haralick feature extraction, we take all its features, which are 14 features. The results of Haralick's extraction can be seen in Table 3. While the LBP feature extraction used parameters p=8 and distance r=1 which resulted in 36 features. The results of LBP feature extraction can be seen in Table 4. These tables show example results from some of the selected images.

Combining extraction results
This step is to form a new dataset by combining the data extracted by Haralick with the results of the LBP extraction. The number of features in the combined dataset is 50 features, consisting of 14 Haralick features and 36 LBP features. From here, there are 3 datasets that are ready to be entered into the next step, namely: Haralick, LBP, and Haralick+LBP.

Data mining
We created the schema model using Rapidminer, as shown in Figure 4. Based on this scheme, the best accuracy is obtained in each model based on the relevant attribute FS and optimal hyper-parameter OCSVM. Details can be seen in Table 5

Evaluation
After obtaining the optimum hyper-parameter data, our proposed model is re-examined on the test data that has been prepared. The results can be seen in Table 6. The table shows that the proposed model consistently produces the best performance. Several misclassifications were obtained from the evaluation results because the test images had a pattern similar to the training data. However, we grouped the data into the false class because they had photo variables that did not match the training data, such as wrong dimensions, incorrect color, and object position. This misclassification is reasonable because these three things do not affect the extraction of Haralick and LBP texture features. On the other hand, the true class was misclassified as false due to the hyperplane (boundary) in OCSVM being affected by the Nu parameter. The Nu parameter controls how many outliers we want to allow. In this study, some true class data were not classified in the correct class due to a strict hyperplane because the optimum Nu value is too small. Table 7 shows the examples of misclassification results.

CONCLUSION
The experiments have proven that the proposed model, which is a combination of Haralick and LBP feature extraction, is classified with the OCSVM algorithm and forward selection feature selection (Haralick+LBP+FS+OCSVM), outperforms other models with an accuracy of 95.25% on data validation, and 91.75% on testing data. Thus, we conclude that combining the results of Haralick and LBP feature extraction selected with forward selection can improve the performance of OCSVM in classifying self-portrait. For further research, because profile photos require precise dimensions, color, and object position, it is necessary to use a method that is not invariant to these three factors.