Modeling of agarwood oil compounds based on linear regression and ANN for oil quality classification

ABSTRACT


INTRODUCTION
Essential oils have highly concentrated compounds and produce aromatic smells. They are extracted from plants as secondary metabolites [1], [2]. Agarwood oil is a well-known essential oil extracted from agarwood trees that belong to the genus Aquiliria Malaccensis of the plant family Thymelaeaceae [3]. Agarwood is resin-impregnated heartwood. All parts of the agarwood plant has its own use [4]. For example, tree trunks and branches can produce quality powder and in addition the essential oils produce from the stems [4]. Agarwood is used in incense, perfumery, medicine and ceremonies by numerous cultures and religious [3], [5], [6]. To obtain a good quality of agarwood oil, it must be through a grading process. Previously, agarwood oil has been graded according to its physical properties such as color intensity, density, and odor as well as by using human sensory panels. However, the limitation in terms of subjectivity, poor reproducibility and, also time consumption are the factor that considering the technique is not efficient to be used [3], [7], [8]. Besides, dealing with a bulk of samples and continuous production in a continuous period of time can cause fatigue to the human nose [3].
Essential oil quality classification can be performed based on chemical profiles, so that essential oil can be accurately classified according to their respective classes (high or low grade). Researchers now use a modern techniques for classifying agarwood oil using machine learning techniques such as artificial neural network (ANN), support vector machine (SVM), multilayer perceptron (MLP) and also various analysis techniques [7], [9]- [13]. An existing study applied ANN technique for grading agarwood oil [10]. The researcher builds two model of ANN, where the first model with 16 samples of data and another model with 90 samples (16 original samples x 5 repetitions of synthetic data). The objective of the researcher is to observe the performance of ANN within large samples of data and the results found that ANN with large samples of data performed in higher accuracy [10]. Next, a method of SVM is used in grading agarwood oil by using two kernel parameters which are multilayer perceptron (MLP) and radial basis function (RBF) [14]. The scaled conjugate gradient (SCG) as MLP algorithm has been successfully applied in grading agarwood oil [7]. The samples data are fed into MLP and the modelling using SCG with 10 hidden neurons and data division of 70%:15%:15 for training, validation and testing. The accuracy and MSE value is taken as consideration for SCG performance. Furthermore, a researcher has been done a study on agarwood oil using ANN and analyzes chemical compounds using GC-MS and SPME method [15]. The findings of the study state that there are few compounds should be call attention for high grade quality agarwood oil and can be used as a future benchmark in classifying agarwood oil. The compounds are aromadendran, 10-epi-eudesmol, β-agarofuran, α-agarofuran and β-eudesmol [15].
Currently, the MLP network has widely used among the researcher in any areas of studies such as in grading the essential oil, pattern recognition, oil industry and many others [16]- [23]. MLP successfully implement in estimating the extraction yield on Oregano and Valerian essential oil [16], [17]. Researcher [17] used training algorithms; SCG, RBP, LM and Gradient Descent, while researcher [16] only used LM algorithm. The only similarities in both research studies are the results obtained showed that LM algorithm performed good accuracy with minimum error. Researcher [18] successfully implemented MLP to detect infant hypothyroidism using cry signal that extracts using MFCC analysis. Different number of hidden neurons and the number of coefficients are varied in this experiment as well as a scaled conjugate gradient (SCG) as training algorithm. The output obtained concluded that the lowest MSE and highest accuracy performed at hidden neurons 15 with the 20 feature coefficients [18]. Based on existing work in [19] and [20], the MLP network is used in pattern recognition in which recognize 18 leaves images by classifying the defected and un-defected leaves and classifying the resume according to the job specification by word recognition.
In this research study, a new engineering approach technique is proposed using Stepwise Linear Regression as statistical analysis technique to predict the relationship between the input and output of agarwood oil. The MLP implemented with the training algorithms; resilient backpropagation (RBP), scaled conjugate gradient (SCG) and levenberg marquardt (LM) as classifiers to classify agarwood oil into high and low quality. The performances of each training algorithm are compared using a Confusion Matrix and mean square error (MSE). The objective of this research study is to predict the compounds of agarwood oil that gives the major factors in the quality of agarwood oil using stepwise regression and after that to validate the selected compounds using training algorithms of MLP for classification.

Stepwise regression
Firstly, the method started with stepwise regression. Stepwise multivariable regression has been selected for this study because there was more than one independent variables of agarwood oil compounds used in this research study. Stepwise regression was established as shown in Figure 1 Step 1: Design variables The sample data were loaded into the MATLAB version R2017a. Then, the data were divided into two classes; independent (X) and dependent variables (Y). The X variables were X1= β-agarofuran, X2=αagarofuran, X3=10-epi-ϒ-eudesmol, X4=ϒ-Eudesmol, X5=Longifolol, X6=Hexadecanol and X7=Eudesmol according to seven inputs of selected significant compounds. The Y variables were set as agarwood oil quality consist of high and low quality.
Step 2: Train the dataset The data were then trained by stepwise regression. Stepwise regression includes the forward selection and backward elimination method. In this experiment, the significance value of alpha, α was set to 0.05 or 5% as recommended by existing study in [24].
Step 3: Forward selection Starting with the empty variables, stepwise calculates the p-value of the X variables. The p-value of each X variables was observed step-by-step and if the p-value was less than 0.05, the X variable was added to the model; otherwise, the X variable was not be added and forward selection continued searching for another X variable that was not in the model until the significant variables were included.
Step 4: Backward elimination Next, the backward elimination observed the p-value of X variables that are already in the model. The X variables will be removed from the model if the p-value was more than 0.05; otherwise, the variable would be maintained in the model. The process of adding and removing the X variables was done repeatedly and automatically by MATLAB until finding all X variables with the strongest effects on the agarwood oil.
Step 5: Statistical analysis and interpret the output The model of stepwise multiple regressions in this research study can be represented as Y = m1X1 + m2X2 + m3X3 + m4X4 + m5X5 + m6X6 + m7X7 + C, where Y is the dependent variable (quality of agarwood oil), X1 until X7 are the independent variables of agarwood oil, and m1 to m7 are the coefficient of each independent variables. The C indicates the error term [25], [26]. Next, the results of stepwise regression for this experiment were generated by MATLAB within some parameters, such as the root mean squared error (RMSE), coefficient of determination R 2 , adjusted coefficient of determination R 2 (Adjst R 2 ), the F-statistics and the overall p-value of the model. The percentage of variability response of dependent variables was determined by the correlation of determination R 2 and adjusted R 2 and the equations were as shown in (1) and (2). The root mean square error (RMSE) was the estimation of the standard deviation of the error distribution and calculated using (3). SSE was the sum of squared errors, SSR was the sum of squared regression and SST was the sum of squared total.

ANN modeling
From the Figure 2, the process continued from stepwise regression. Firstly, the output features from stepwise regression were set as the input feature for MLP network classification for validation. The modeling of agarwood oil data in MLP used pattern recognition network (patternet) function in MATLAB R2017a. Next, the project continued with the data pre-processing, which the data consists of data normalization, data randomization and data division. The data division was divided into three groups; training, validation and testing, with ratios of 70%:15%:15% respectively. The method then continued to train and classify the dataset using MLP training algorithms. In this stage, the training algorithms were varied using three different algorithms including resilient backpropagation (RBP), scaled-conjugate gradient (SCG) and levenberg marquardt (LM). The number of hidden neurons was set to 10 neurons in a single hidden layer. The activation functions used in the MLP network model were the sigmoid function for the hidden layer and log sigmoid function for the output layer. After training, the dataset were validated and tested for further performance. The performance of this classification was evaluated using confusion matrix which consists of accuracy, sensitivity, specificity and precision. Additionally, the training was done repeatedly in order to obtain the desired mean-squared error (MSE) value in each of hidden neurons and training algorithms. Table 1 shows the confusion matrix used in the MLP modeling to describe the classification performance of agarwood oil.  (sp) (en) ep is the number of accurately classified to low category; en is the number of accurately classified to high category; sp is inaccurately classified to high category; sn is inaccurately classified to low category.
The equations for each performance criteria are provided below [27]. The accuracy, ACRY as in (4) is defined the overall effectiveness of a classifier: The capability of a classifier to classify the low quality group is defined as the sensitivity, STVY. Thus the interpretation for sensitivity (5) is as follows: The specificity, SPCT is interpreted as practicable with the partitioned detects negative tag: The precision, PRCS is the class agreement of the data labels with the positive labels given by the classifier.
= + 100 (7) Table 2 shows the four compounds selected by stepwise regression that are significant to the model. The compounds are X1=β-agarofuran, X4=ϒ-Eudesmol, X5=Longifolol, and X7=Eudesmol. All independent variables were selected by forward selection according to the p-value less than 5% significance level. These four compounds are only the major factors in agarwood oil quality. After that, MATLAB has generated an estimation of predicted output for agarwood oil as calculated and tabulated on Table 3. From the estimate value coefficients, a linear model was developed using the following stepwise multivariable equation: Y = X1 + X4 + X5 + X7 + C Y = 1.7337 + 0.031742 X1 + 0.021556 X4 + 0.10766 X5 -0.25592 X7

Stepwise regression
The output of stepwise regression has been summarized and tabulated in Table 4. The RMSE value was 0.0224 with the value of R 2 and R 2 Adj were 68.7% and 67.4% respectively. The overall p-value for the F-test was 3.4x10 -22 and the significant relationship of response and predictor variables in terms of F-statistics was 50.

MLP network
The four compounds selected from stepwise method were fed into the MLP network model as the input features. The compounds were X1=β-agarofuran, X4=ϒ-Eudesmol, X5=Longifolol and X7=Eudesmol. The results of accuracy and MSE for each training algorithms have been tabulated in Tables 5-7. The results show that the training accuracy for all algorithms was varied from 86.8% to 100%, while the validation accuracy varied from 92.9% to 100%. The testing accuracy varied from 85.7% to 100%.    From the Tables 5-7, the one hidden neuron in each algorithm was chosen as the best neuron for the classification of agarwood oil into high and low quality categories. The factor of choosing one hidden neuron as the best neuron was the accuracy value achieved accurately more than 80% with minimum MSE value at Int J Elec & Comp Eng ISSN: 2088-8708  the early stage of training. Also, one hidden neuron is sufficient to avoid long computational time and overfitting problems [17], [18]. The results of MSE for each hidden neuron for all training algorithms shared the same average value. Firstly, for the SCG obtained good accuracy at the early stage of training with the value are greater than 90% accuracy and the rest 100% for validation and testing with the MSE of 0.0446. The accuracy results obtained by LM algorithms were the same as those of SCG in all datasets, with the MSE value of 0.0384. The accuracy results of RBP reached greater than 80% for training and 100% for validation and testing with MSE value of 0.0468. In the Figure 3, the graph pattern shows that the MSE value at one hidden neuron for SCG, LM and RBP was the highest which the value are 0.0446, 0.0384 and 0.0468 respectively. The high value of MSE at the initial stage of training is due to the initial adjustment of weight, thus producing the significantly different outputs from the actual data [9]. However, the increasing number of hidden neurons made the MSE value decrease as the weights network training became more stable [9].  Table 8 summarizes the final design parameter of one hidden neuron in each training algorithm. When comparing the MSE value at one hidden neuron in SCG, LM and RBP, the LM algorithm outperformed others. The MSE value of the LM algorithm in one hidden neuron was the lowest compared to SCG and RBP with the value 0.0384 while SCG have MSE value with 0.0446 and MSE value of RBP is 0.0468. Obviously, LM was the best algorithm for agarwood oil classification. This proved that LM was the most capable in approaching the direction of steepest descent in providing the lowest the sum of squared errors. These finding indicate that SCG was fast in term of computational time, but greater error compare to LM. Table 9 shows the confusion matrix of training dataset in one hidden neuron in LM algorithm. For sixty eight training numbers, the predicted class has successfully predicted seventeen numbers to low quality class and fifty one numbers to high quality class. Table 10 shows the results of the LM algorithm for training classification performance through a confusion matrix. The accuracy was 92.6%, the sensitivity was 76.5%, and the specificity and precision were both 98% and 92.9% respectively.

CONCLUSION
This study represents two models, stepwise regression and multilayer perceptron, for use in the classification agarwood oil quality. Four independent variables (compounds) were selected as the factors to agarwood oil. The compounds were β-agarofuran, ϒ-Eudesmol, Longifolol and Eudesmol A linear relationship for agarwood oil was obtained at Y = 0.031742 X1 + 0.021556 X4 + 0.10766 X5 -0.25592 X7 + 1.7337 with values of R 2 and R 2 Adj is 68.7% and 67.4%. The three training algorithms in MLP were successfully implemented as classifiers in order to classify agarwood oil into high and low quality. It was found that the network architecture (4-1-1) with referred to 4 neurons in input layer, 1 hidden neuron in a single hidden layer and 1 neuron in output layer was most suitable for modeling agarwood oil in this study. The MLP with the LM algorithm using one hidden neuron outperformed others in term of accuracy and mean square error (MSE). This technique will benefit to the agarwood oil industry especially in terms of its grading system.