http://ijece.iaescore.com Identification of important features and data mining classification techniques in predicting employee absenteeism at

Info 2021 Employees absenteeism at the work costs organizations billions a year. Prediction of employees’ absenteeism and the reasons behind their absence help organizations in reducing expenses and increasing productivity. Data mining turns the vast volume of human resources data into information that can help in decision-making and prediction. Although the selection of features is a critical step in data mining to enhance the efficiency of the final prediction, it is not yet known which method of feature selection is better. Therefore, this paper aims to compare the performance of three well-known feature selection methods in absenteeism prediction, which are relief-based feature selection, correlation-based feature selection and information-gain feature selection. In addition, this paper aims to find the best combination of feature selection method and data mining technique in enhancing the absenteeism prediction accuracy. Seven classification techniques were used as the prediction model. Additionally, cross-validation approach was utilized to assess the applied prediction models to have more realistic and reliable results. The used dataset was built at a courier company in Brazil with records of absenteeism at work. Regarding experimental results, correlation-based feature selection surpasses the other methods through the performance measurements. Furthermore, bagging classifier was the best-performing data mining technique when features were selected using correlation-based feature selection with an accuracy rate of


INTRODUCTION
Absenteeism at work can be described as a "habitual pattern of absence from a duty or obligation" [1]. This happens when employees do not show up or engage in events related either directly or indirectly to their jobs [2]. In general, absenteeism is believed to be a main indicator of poor performance [3]. Unpredicted absenteeism causes extra workload for other staff and reduces work efficiency. It also may results in low productivity and high direct and indirect costs [4]. It is therefore necessary for organizations that are heavily dependent on human resources to develop and implement absenteeism-prediction mechanism in order to help managers take preventative actions against the absence of employees to reduce financial costs [2], [5]. Organizations usually collect data about employees which could be used in improving decisionmaking processes. Data mining provides us a forum for predicting, analyzing, and grouping different problems of various genres without a subject matter expert. Data mining techniques are being steadily implemented now in different domains. One of the domains which still requires the interference of data mining is human resource management. Even though the pattern in human behavior is difficult to analyze, data mining techniques helps us to identify hidden and interesting pattern [6].
An in-depth analysis of the huge amount of data collected by organizations would take a long time and require a lot of human capital. When there is an abundance of irrelevant data, it is unlikely to be readily understood and absorbed [7]. Consequently, a very important issue for predicting human behavior, especially for predicting absenteeism at work, is how to filter and summarize a huge volume of data. As a preprocessing step, feature selection is one among the foremost essential steps in data mining process. It aims at filtering out the original data from irrelevant and redundant features [8], [9]. Irrelevant and redundant data inserted into a model could consume a great deal of cost and time and even reduce the degree of model accuracy [10], [11].
While there are many feature selection methods that might be used for absenteeism prediction, the paper's first research question is: what is the best method for enabling the prediction models to deliver the best performance?. The second research question is: what is the best combination of feature selection method and data mining technique to enhance absenteeism prediction accuracy?. In this paper, we take into consideration three feature selection methods to compare their prediction accuracy in absenteeism prediction, namely, relief-based feature selection, correlation-based feature selection and information-gain feature selection. To find the best combination of feature selection method and data mining technique to enhance the absenteeism prediction accuracy, seven classification techniques were used as the prediction model, namely, naive Bayes, logistic regression, multilayer perceptron, k-nearest neighbor, bagging, J48, and random forest. Additionally, cross-validation (CV) approach was used to evaluate the applied prediction models to have more realistic and reliable results. The dataset used was developed at a courier company in Brazil with records of absenteeism at work.
In previous works, researchers attempted to use different data mining techniques or merge different models to deal with the absenteeism prediction problem. Martiniano et al. [12], built a neuro-fuzzy network utilizing the error backpropagation algorithm with multilayer perceptron in absenteeism at work prediction. Another study was conducted by Nunung et al. to develop a decision tree classifier to discover the common features of employees who were regularly absent from the workplace [13]. In a recent study, Gayathri used naive Bayes, multilayer perceptron, and J48 classifiers to create a classification model to predict employee absenteeism for a short or long period of time [14]. Ferreira et al. [15], the researchers used artificial neural networks. In a similar research [1], the authors suggested the use of neural networks and deep learning algorithms to predict employees' behaviors regarding adherence at their workplace.
Literature shows the lack of studies that have applied data mining techniques in absenteeism at work prediction and indicates that the critical process of feature selection is not carefully considered. According to Dogruyol et al. no research has focused on finding the appropriate methods to compare and evaluate the performance of different data mining classification techniques while using particular combinations of features [4]. Thus, a thorough analysis is needed to test various feature selection methods to define the most relevant features and data mining classification techniques that will enhance the performance of prediction and ensure that the results are accurate and acceptable.
The contributions of this paper can be summarized as follows: help organizations in finding the main reason behind employees' absence to reduce expenses and increase productivity, improve the accuracy prediction, understand the best method of feature selection for efficient prediction of absenteeism, and identify the baseline feature selection method for relevant research in the future. The outline of the paper is as follows: Section 2 demonstrates the research method followed by this paper. Sections 3 and 4 illustrate the experimental results and discussions. The conclusion of the research is outlined in section 5.

RESEARCH METHOD
The primary objective of this research is to compare three well-known feature selection methods used in absenteeism prediction to examine their prediction performance and to find the best combination of feature selection method with data mining technique in enhancing the absenteeism prediction accuracy. There are two research questions that will be answered by this study: -RQ1: What are the important algorithms of feature selection to predict the absenteeism of employees at work? -RQ2: what is the best combination of feature selection method and data mining technique in enhancing absenteeism prediction accuracy? In order to perform the research goal and to address the abovementioned research questions, we began the process of data mining with data preprocessing, followed by feature extraction and then classification modeling. Classification modeling was repeated for the combination of attributes selected by each feature selection method. During each iteration, the output of each developed model was documented, depending on the selected features and the data mining techniques, and the results were presented after the full process had been completed. In this study we used 10-fold cross-validation because experimental results of previous studies proved that the optimum number of folds seems to be 10, as it optimizes the required time to perform the test and at the same time reduces the bias correlated with the validation process [16].

Dataset description
The used dataset in this research consists of records of employees' absenteeism at a courier company in Brazil. These records were collected from July 2007 to July 2010 and were later made available at the UCI machine learning repository [12]. The dataset consists of 740 records with 21 attributes. The attributes and their descriptions are shown in Table 1.

Data preprocessing
Data preprocessing is a significant step in data mining applications. Data collection methods are often poorly regulated, resulting in missing values, out-of-range values, and unlikely data combinations. These problems can result in misleading findings if they are not examined at the beginning of the data mining process. Therefore, the representation and quality of the data should be addressed before the analysis is carried out [17]. In reality, most of the time required by data processing is spent creating data mining applications [18].
The absenteeism-at-work dataset was structured with no missing values. However, through careful examination of the dataset, we found that we needed to do dimension reduction and grouping for some attributes. Since the attribute ID does not has influence on absenteeism, we removed it from the attributes list. As a result, twenty attributes were selected after the data was cleaned. We noticed that the values of some attributes in the dataset were scattered, which would make getting good prediction results difficult and complicated. We found that the solution lay in grouping values of certain attributes in order to improve prediction results. Grouping values is necessary when the number of values of an attribute is too high, since dealing with each value individually can lead to problems during computation and interpretation [19]. Table 2 illustrates grouping categories and values of some attributes as presented in [11].

Feature selection algorithms
Most real-world data contain more details than what is required to build a model. Such redundant details make extracting the most significant information more difficult [8]. Feature selection is the process of selecting the most important and most relevant features of a dataset [20], [21]. Feature selection enhances the performance of the prediction model, makes the modeling process more efficient, and provides better understanding of the data [9].
There are many feature selection methods available in literature. After this study's experiment, we used three methods to discover the most influential attributes-namely, correlation-based, information-gain, and relief. While the correlation-based feature selection is a greedy search method, the others are rank-based search methods [22]. By using these methods, we have identified ten attributes as the most influential features. The selected ten features were used in building prediction models, while the rest were omitted.
Relief-based feature selection (RFS). This method allocates weights to all dataset features, and these weights can be modified over time. The most-relevant features have a high weight value, and the rest of the features have low weights. The techniques used by relief are the same as those used by k-nearest neighbor, which assigns weights to features [23].
Correlation-based feature selection (CFS). This method ranks the subset of features in accordance to their association with other features and the label of class. Subsets of features that demonstrate robust association with the label of class and low association with other features are assigned a higher value. This method is considered multivariate, as it removes all the redundant and irrelevant features from the dataset [24].
Information-gain feature selection (IGFS). This method ranks the subset of features according to high information gain entropy in descending order. This algorithm specifies a threshold value, and the attributes whose values exceed the threshold require additional processing [25].

Experimental setup
In this study, we have chosen to perform the experiments using Waikato environment for knowledge analysis (WEKA). WEKA is a commonly used data mining method that applies most data mining techniques and provides visualization of the results [26]. WEKA offers a powerful and user-friendly visual design environment for creating and testing various feature selection and prediction models.

EXPERIMENTAL RESULTS
We conducted the experiment in three phases. First, we examined the performance of various data mining algorithms such as naive Bayes, logistic regression, multilayer perceptron, k-nearest neighbor, bagging, J48, and random forest on full features of the absenteeism-at-work dataset. Second, we used three feature selection algorithms-relief, correlation-based, and information-gain-to select important features. Third, we checked the performance of classifiers on the selected features. The effectiveness of each classifier was evaluated in terms of accuracy, precision, recall, and time to build the model. We used 10-fold crossvalidation because the number of selected features was small [27].

Experiment 1: Comparison of classifiers performance on full features (n=19)
For this experiment, cross-validation was used to examine the performance of the seven data mining classifiers on the full features of the dataset where specified values of parameters were passed over the classifiers. For this experiment, cross-validation was used to check the performance of the seven data mining classifiers on the full features of the dataset were specified values of parameters were passed over the classifiers. In Table 3, the random forest shows good performance that has 91% classification accuracy, 88% precision, and 91% recall.

Experiment 2: Comparisons of feature selection methods
Three feature selection methods were chosen in this experiment to select the most relevant features to be used with the classifiers and compare their performance. We conducted experiments on different number of selected features, but in our simulation results we recorded the performance of classifiers on only 10 features as we found that the performance of classifiers on 10 features was very good:

Results of relief-based feature-selection algorithm
The relief-based feature-selection algorithm designates weights to features and selects significant features based on their weights [28]. According to the results, disciplinary failure and reason for absence were the most important features selected by relief for the prediction of absenteeism at work. Figure 1 displays weights assigned to all features using relief and Table 4 shows the selected significant features.

Results of correlation-based feature-selection algorithm
The CFS approach suggests that important features show a strong correlation with the label of class and a low correlation with other features, so a high weight should be given to such features [29]. After examining the results, we found that disciplinary failure and reason for absence were the most important  Figure 2 displays weights assigned to all features using CFS and Table 5 shows the selected significant features.  Figure 2. Weights assigned to features using CFS

Results of information-gain feature-selection algorithm
IGSF ranks features in descending order, depending on the high information entropy gain. When examining the results, we found that reason for absence and disciplinary failure were the most important features selected by IGSF for the prediction of absenteeism at work. Figure 3 displays weights assigned to all features using IGSF and Table 6 shows the selected significant features. Tables 4-6 show the significant features selected by three feature-selection algorithms for the prediction of absenteeism at work. The first features selected by CFS have high scores, which means that CFS features have a strong influence on the prediction of absenteeism at work.

Experiment 3: Comparison of classifiers' performance on selected features (n=10)
In this experiment, we applied seven classification methods on the 10 attributes selected from the main dataset. We repeated the experiment with each feature-selection method. Once a predictive model was  Tables 7-9. The results generated with classifiers optimized by the CFS method were the best reported results. The bagging model showed the best results compared to the other classifier algorithms, with an accuracy of 92%, precision of 90%, and recall of 92%. Logistic regression, k-nearest neighbor, and J48 classifiers achieved a competitive result, with accuracies of 91% when features were selected using the CFS extraction method. Although the performance of using random forest with reliefbased and information-gain feature selection methods could achieve a high result with an accuracy of 90%, the achieved performance is still lower than when using the classifier with the full dataset. This is because random forest is great with high dimensional data, and it provides estimates of the important variables in the classification [30]. Training neural network models (multilayer perceptron) takes more time than training other data mining models [31]. Generally, the main issues related to inductive learning are the time required to build the model and the accuracy of classifying new examples [32].

DISCUSSION
In this paper, we applied data mining algorithms on the absenteeism at work dataset to predict absenteeism hours, based on the data of employees' attributes. Our aim was to compare various classification models using some feature-extraction methods and to identify the most effective model. From the tables above, we note that various algorithms performed better based on which feature-selection method was used. Each algorithm is capable of outperforming another algorithm based on the situation. For example, random forest performs better with a large number of features than with a subset of features, while bagging performs better with a select number of features. In general, the efficiency of the algorithms increased by applying feature-selection methods, and the correlation-based feature-selection method gave the best results. This demonstrates the need for feature selection before a classification is applied to the date. After applying feature-selection methods, we used performance metrics to compare the different data mining algorithms, since this is a standard process in evaluating algorithms. Without the optimization of feature selection, the best average precision value was from k-nearest neighbor with 89%. After optimizing by using the correlation-based feature-extraction method, we found the best precision was from bagging and k-nearest neighbor with 90%. Finally, when we compared the accuracy of the various classification algorithms with the feature-selection methods, we found that the best one was bagging with 92% accuracy.

CONCLUSION
Absenteeism at work is perceived as one of the most significant problems for organizations, as it may raise their expenses and pose a barrier to the accomplishment of organizational goals and priorities. It is important to build and incorporate methods for predicting absenteeism at work in organizations to enable management to take action against the shortage of employees and reduce financial costs. The goal of this study was to identify important features and the best-performing classification methods that enhance the accuracy of absenteeism-at-work prediction. We took advantage of feature-selection methods with a goal of enhancing the quality of absenteeism prediction. An experiment was first conducted to find the highinfluence attributes. Then, classification was performed based on different classification algorithms such as naive Bayes, logistic regression, multilayer perceptron, k-nearest neighbor, bagging, J48, and random forest. The experimental results reconfirmed the significance of the selected features. Additionally, among the top seven methods, bagging has outperformed the other methods with 92% accuracy when features were selected using CFS. This research could be improved in the future in several ways.
There are many ways to enhance this research and address the limitations of this study. The same experiment could be carried out on a broad scale with real-world datasets to expand this work and generalize the findings. In addition, the performance of other data mining techniques in absenteeism prediction could be tested. Furthermore, new methods for selecting features may be used to obtain a broader perspective on the critical features used to enhance prediction accuracy.