Transformations for non-destructive evaluation of brix in mango by reflectance spectroscopy and machine learning

ABSTRACT


INTRODUCTION
Mango is a very nutritious fruit and one of the most consumed in the world, because it contains nutrients such as proteins, carbohydrates, fats, minerals and vitamins [1], [2].World exports of mango, mangosteen and guava have exceeded 2.3 million tons; Mexico, Thailand, Brazil, Peru and India are the largest exporters, with Europe as the main export destination [3].Peruvian mangoes stand out due to their high quality, which is why they are increasingly in demand.The mango varieties cultivated in Peru are Haden, Edward, Keitt and Kent; being the north of Peru, exactly the city of Piura the largest producer of this fruit, followed by Lambayeque and Ancash, achieving the country to export more than 240,000 tons of mango during 2021-2022 campaign [4].
In many countries, traditional methods for estimating nutrients in mango plants are performed at stages when the mango is still growing, these techniques are invasive and time consuming.In addition, these analytical methods used for determination of total soluble solids (TSS) are destructive and are not suitable for remotely monitoring fruit quality in modern grading lines [5].An alternative method presents more attractive approaches for the measurement of characteristics that correlate with fruit maturity where near infrared (NIR) spectroscopy technology is used, since it represents a high-speed alternative for data acquisition and with the help of calibration techniques the correct prediction can be achieved [5], [6].Because of this, the present study focuses on using NIR spectroscopy to predict the TSS or ºBrix as an indicator of the degree of maturity of Edwards mangoes using Machine Learning techniques.This aims to reduce the time in determining the internal characteristics of the products, especially without having to destroy or alter the macro and micronutrients of the fruit.Our research is focused on determining the transformations, pre-processing technique, input characteristics and dimensional reduction algorithm best suited for the creation of these models.
Initially, studies are presented that highlight the importance of using NIR spectroscopy to measure fruit quality indicators.A research work that evaluates the ability of this technology to determine the chemical composition, such as TSS and titratable acidity (TA) in 3 different fruits, shows encouraging results, affirming that this technology has a favorable use in the prediction of internal content in intact fruits.However, it is highlighted that in species such as passion fruit and tomato, the results may be affected by their physical structure [7].
In addition, a Fourier transform NIR spectroscopy study is presented as a viable option for the rapid localization of diseases in fruit [8].In the same way, a portable spectrometer is used to monitor the evolution of the firmness of the Kent mango in a non-destructive way, evaluating its physical and chemical properties during the ripening stage, obtaining results that demonstrate the relationship between changes in the firmness of the fruit and its spectral signature [9].Some other examples are observed in the work on the measurement of SSC, TA, vitamin C and surface color of mandarins [10]; dry matter (DM) content, potential of hydrogen (pH), TSS and acid-Brix ratio (ABR) for bananas [11]; firmness, starch index and total soluble solids content in apples [12].It is evident that near infrared spectroscopy is a fast and non-destructive method to evidence fruit quality; therefore, it saves labor hours, manpower and improves accuracy in grading these fruits.Therefore, being able to study changes in these parameters will allow producing quality fruits and harvesting them at the right time, either for immediate consumption or to withstand export time [13].
Regarding the spectrum range, in [7] a multipurpose analyzer spectrometer was used in reflectance mode with a spectrum range from 800 to 2,700 nm with a spectral resolution of 2 nm.The general shapes of the spectra of tomato, apricot and passion fruit were quite similar, it is worth noting that the spectrum of passion fruit had a slight shift, this similarity is mainly due to the fact that these fruits contain between 80% and 90% water.Of equal importance, in [14] spectral acquisition was performed with a range of 300 to 1,000 nm, the analysis of the NIR spectrum showed that the curves were different between green and ripe mango fruits in the range of 550 to 700 nm, above 750 nm a relatively flat area was observed with different reflectance values between the two types.In the NIR range 700 to 990 the best model for predicting DM, TSS and maturity was obtained, without using any processing technique to the spectral data.Another study [13] pointed out that the use of the entire spectral range is detrimental to the calculation, since these are usually of large volume and the purchasing cost of spectral reading devices increases proportionally to the total range.For these reasons, the most relevant lengths are analyzed with artificial neural network-simulated annealing algorithm (ANN-SA).The manuscript worked on the estimation of the TSS and BrimA of the Gala apple, which determined that for TSS the wavelengths 953, 961, 977 and 983 nm are the most effective and for BrimA are 958, 966, 972 and 984 nm [13].Likewise, in 2020, a study used an algorithm based on an artificial neural network-differential evolutionary (ANN-DE) that determined the effective wavelengths for 3 properties (firmness, acidity and starch content) of Fuji apple.This showed that in the range of 841 to 882 nm each property is predicted with a higher regression coefficient compared to working with the full wavelength range of 400 to 1,000 nm, this work demonstrates that it is possible to develop a lower cost portable device working in the optimal range and being equally efficient [15].Similarly important, in [9] a portable visible near infrared spectrometer was used in the 310 to 1,130 nm range at a spectral resolution of 8 to 13 nm, the measurement was performed at 2 different points for a sample in order to predict the firmness of Kent mangoes at different stages of their ripening process.In this work, the interval partial least squares regression (iPLSR) algorithm was used for the selection of spectral regions of greatest relevance.This algorithm selected the range of 743 to 770 nm and 870 to 905 nm, which contribute to the prediction of firmness, in comparison to the use of the whole spectrum there is a 14% improvement in the prediction error.
On the other hand, methods based on the spectral signature analysis of different fruits are presented.A partial least squares regression (PLSR) modeling study was implemented in Brazil, using NIR spectroscopy to estimate firmness and TSS concentration in peach fruits.Spectrums were obtained as log 1/R and model performance was evaluated as a function of root mean square error (RMSE) and coefficient of determination (R 2 ) values.Principal component analysis (PCA) failed to group fruit according to blush and skin background color, maturity stages and harvest season.However, NIR spectroscopy with partial least squares (PLS) proved to be a potential analytical method for determining TSS and firmness of the cultivar 'Aurora 1' [5].In contrast, in another paper, TA, TSS and pulp content (PC) of passion fruit were predicted by NIR spectroscopy using the PLSR prediction model.Encouraging results were obtained for TA and PC with correlation coefficients of 0.91 and 0.99 respectively, however, only a value of 0.84 was obtained for TSS [16].Then, in Cebu University of Technology -Philippines, a study was conducted to predict DM, TSS and mango maturity using PLSR analysis, the findings found that the calibration model using NIR spectra, based on the coefficient of determination predicts DM with R 2 =0.774,TSS with R 2 =0.774 and maturity with R 2 =0.946, being optimal values that will serve as a basis for quality control and an automatic product classification system according to their fruit characteristics [14].In another study presented, visible near-infrared spectroscopy (VNIRS) was used for nondestructive prediction of mango firmness during ripening, comparing the use of standard PLSR using the entire spectral range against an improved PLSR, which uses the variables selected by iPLSR.In response, an improved predictive model was obtained with an R 2 =0.75, obtaining a better result compared to the standard PLSR model with R 2 =0.67, achieving an increase of 12% in the R 2 [9].
The above paragraphs present several investigations that describe encouraging results in the use of spectroscopic signals with PLSR and PCR algorithms for the estimation of internal parameters of mango fruit.Some of these studies use reflectance (R) as the model input variable [9], [17]- [19].However, others of these must resort to preprocessing techniques to obtain good results.A common transformation is to apply the logarithm of the reciprocal reflectance log(1/R).For example, applied to mango to estimate softening of the flesh, total soluble solids content and acidity [19], but it has also been used to estimate internal properties of apple [20], bayberry [21], [22], pear [23]- [25], orange [26], bell peppers [18], strawberry [27], kiwifruit [28], low chilling peach [5], passion fruit, tomato and apricot [7].Another transformation used is the first derivative of the reflectance dR for mango [19], but it has also been used in bell pepper [17], [18].Continuing with the analysis of the derivative, other works have reported results with the first derivative of the logarithm of the reciprocal reflectance dlog(1/R) applied to mango [19], apple [20], pear [23], [25], bell pepper [17], [18], avocado [29], passion fruit, tomato and apricot [7] and mandarin fruit [10].
Spectral pretreatments known as standard normal variate (SNV) and multiplicative scatter correction (MSC) often give very similar results and are considered interchangeable [30].Such techniques have already been applied in mango spectroscopy, on the one hand, SNV and MSC were used to estimate the TSS of the 'Nam Dok Mai Sithong' mango [31], while only MSC was used to estimate firmness in the 'Kent' mango [9].Additionally, other investigations have used both techniques with PLSR and/or PCR showing encouraging results in fruits.For example, MSC and SNV have been used in orange fruit to estimate TSS, TA and BrimA [32], [33], also in loquat to estimate TSS and acidity [34], in symplocos paniculata to estimate oil content and acids [35] and in apples to detect bruise damage.Finally, some investigations have only considered using MSC to work on mandarin fruit [10] and banana [11].
As described above, different pre-processing techniques have been employed in NIR applications with PLSR and PCR.However, it has not been clear which of them would be the most suitable.Therefore, the objective of this study is to determine the most appropriate preprocessing techniques to estimate the TSS of Edwards mango by applying NIR spectroscopy with Machine Learning techniques such as PLSR and PCR.Therefore, in the following sections we analyzed a total of 18 PCR models and 18 PLSR models, where 4 types of transformations, 3 different feature extractors, and 3 different pre-processing techniques are combined.

METHOD
Figure 1 shows the methodology implemented.The methodology has been divided into 4 stages.Spectral data collection and DM measurement, a pre-processing stage of the raw data, a double cross validation (CV) stage and finally a testing phase to obtain the model metrics.

Getting raw data 2.1.1. Spectral signature capture
Spectral signature recording was performed with an AvaSpec-NIR256-1.7-EVONIR spectrometer in reflectance mode.The spectrometer has an operating spectral range of 900 to 1,750 nm with 221 bands.Eighteen Edward variety mangoes have been sampled; however, 12 measurement points have been identified on each of them, making a total of 216 samples.The location of the measurement points is determined by a measurement protocol detailed in Figure 2. First, the mango is placed with the dorsal shoulder to the right, then 3 circles are marked at different levels on the cheek of the mango as shown in Figure 2(a).Subsequently, the mango is turned 90 degrees and 3 new areas are marked at different levels as shown in Figure 2(b).The procedure is repeated as shown in Figures 2(c) and 2(d).In Figure 3 you can see raw spectral data taken in reflectance mode.

Brix measurement
TSS was measured with a 0-32 °Brix hand-held refractometer with automatic temperature compensation (ATC).The objects under study correspond to 18 Edwards variety mangoes whose spectral signatures were previously recorded.The procedure consisted of quantifying the °Brix of a mango juice sample for each of the 12 points.First, the flesh of the fruit is exposed with the aid of a fruit peeler.Secondly, the

Pre-processing
A flow chart of the pre-processing stage is detailed in Figure 5. First, the dependent variable went through a transformation substage (logarithmic, square root, power of two or no transformation).On the other hand, the independent variables Figure 3 went through a filtering sub-stage applying a Savitzky-Golay filter with a 7-point window and a second-degree polynomial.The most used smoothing and differencing technique in chemometrics is the Savitzky-Golay method [36]- [38], which consists of a local polynomial regression requiring equidistant spectrum values, as shown in (1).
where:   * is the value of the curve to be smoothed or derived. is the number of points in the window. is the number of neighboring points per side. ℎ : are the coefficients that depend on the degree of the regression polynomial and the objective (smoothed or derived).The filter outcomes are shown in Figure 6.The filtered independent variable went through a transformation substage as shown in Figure 7, after which one of the following results is obtained: no transformation (R), Figure 7 [39], [40].Unlike the spectral derivatives, the MSC preserves the spectral shape.With   being the vector spectrum  of a set  samples, the MSC model for a spectrum is shown in (2).
where 1 is a column vector of ones of the same length as   .The parameters   and   are estimated for each spectrum   .Thus, each spectrum is corrected as (3).
where   is the value of absorbance  of spectrum .
In the other hand, SNV treats each spectrum separately absorbance values as (4): where ̅  and   are the mean and standard deviation of the absorbances   in spectrum .Thus, the transformed absorbances have zero mean and unit standard deviation in each spectrum.An example can be seen in Figure 8, where SNV is applied to the transformation R in Figure 8 Finally, both the independent variables and the dependent variable went through a scaling stage.Here, the mean was removed and the standard deviation was standardized to a value of 1 for each wavelength.Therefore, as an example, look at Figure 9 which is the result of applying the scaling to Figure 8(a).PCA is especially useful for data sets with correlated variables such as spectral data.However, the technique is sensitive to outliers and scaling.Principal least-squares (PLS) is a technique for relating the matrix  to the vector  [42]- [44].A crucial point to build a predictive model with PCR or PLSR is to decide the number of components to be used.In principle, previously the independent variables can pass a variable selection stage, however for simplicity the project is limited to finding the optimal number of components   .In the internal CV, the calibration set was divided into training and validation sets.The 12 samples from each handle were combined, shuffled and finally distributed in the same set to avoid false optimistic results.Then, the selection of the optimal number of principal components is based on one standard error rule [45].Finally, the   lists of 50 candidates of optimal number of components were analyzed using a histogram.The final   is determined by calculating the mode of the candidates.

Test
The external CV is rerun to build a model with   components.The training uses the entire calibration set and makes predictions from the test set.The procedure is repeated for the other CV segmentations until the prediction of all data is completed as shown in Figure 11.The result of the procedure explained in Figure 12 are the test set predictions scaled by the pre-processing stage and  ̂  that attempt to fit    .In addition,    and  ̂  went through an inversion of the previous scaling and transformation to obtain   and  ̂ .Finally, applying the following equation yields   and   .RMSE is defined in (5):

RESULTS AND DISCUSSION
The method presented consisted of evaluating 2 algorithms, 4 transformations, 3 types of features extractors and 3 pre-processing techniques, in other words, 36 models as shown in Table  Then, the best model for Brix is analyzed.Figure 13 shows the histogram of the optimal number of components found in the double CV for the PLSR- 2 --SNV model.Therefore, finally the model used only 4 components.
Figure 14 shows the MSE on the transformed scale of the variable for different model complexities in the Double CV.The gray line shows the MSE in each of the CVs, while the blue line represents the average of all the gray lines.The red dashed line shows the optimal number of components determined in Figure 13.Finally, the cian dashed line shows the MSE level in the test set.Figure 16 shows prediction errors versus true values.In Figure 16(a) we can see the prediction error in the transformed (y2) and scaled system, while in Figure 16(b) we can see the prediction error in the true °Brix system versus the true value.In both graphs a tendency to a negative error can be observed for high °Brix values.The gray color shows the prediction errors for each of the CVs of the Double CV.Additionally, the prediction errors of the test set are represented with different colors corresponding to each mango.
Figure 17 shows the prediction errors versus sample number.In Figure 17(a) we can see the prediction error in the transformed (y2) and scaled system, while in Figure 17(b) we can see the prediction error in the true °Brix system versus the sample number.In both plots, a mean close to zero and a relatively constant variance distribution can be observed, so the error does not depend of the sample position.The prediction errors of the test set are represented with different colors corresponding to each mango.The results are analyzed based on the average RMSE and the range considering the standard deviation of the expected value according to (6): where  is the standard deviation of the expected value,  ̂ is standard deviation of the samples and  is the number of samples.Table 3 shows the expected RMSE of PCR and PLSR, as well as the ranges for the Brix variable using different characteristics.The PLSR results are better than PCR when using the first derivative, 1.However, insufficient evidence has been found to affirm that PLSR is better than PCR for the other transformations.On the other hand, the use of R has been shown to perform on average better than other features when using PCR.Additionally, the use of R and 1 has shown that on average it presents better results than (1/) when using PLSR.
Table 4 shows the expected RMSE of PCR and PLSR, as well as the ranges for the Brix variable applying different pre-processing techniques.The PLSR results are better than PCR when using MSC.However, insufficient evidence has been found to claim that PLSR is better than PCR with other pre-processing techniques.On the other hand, there is not enough evidence to state that any technique presents better results.

CONCLUSION
A total of 18 PCR and 18 PLSR models have been tested.A methodology based on a double VC has been implemented.The internal CV was used to find the   found with the one stadard deviation rule.The model with the lowest RMSE used PLSR with 4 components, used the  2 transformation, reflectance R as the independent variable, and SNV as the pre-processing technique.This model obtained an RMSE of 1.1382 °Brix and an   of 0.514 on the transformed dimensional scale.The  2 transformation has shown better metrics than other transformations in both algorithms.Additionally, working directly with reflectance has given good results.Sufficient evidence has not been found to affirm that any pre-processing technique is better than another.Additionally, it can be affirmed that the PLSR always showed equal or better results than the PCR.


ISSN: 2088-8708 Int J Elec & Comp Eng, Vol.14, No. 1, February 2024: 532-546 534 Int J Elec & Comp Eng ISSN: 2088-8708  Transformations for non-destructive evaluation of brix in mango by reflectance … (Ernesto Paiva-Peredo)535drops are extracted by inserting a cylindrical tip for 1 cm.Drops fall directly on the refractometer glass and the measurement is then made by exposing the instrument to a natural light source.Subsequently, the glass is cleaned by carefully pouring distilled water on it and finally dried with a cloth to enable the device to be used for a new measurement.Figure4shows the Brix values of the 216 samples taken from 18 mangoes.

Figure 5 .Figure 6 .
Figure 5. Flow diagram of the pre-processing stage

Figure 9 .
Figure 9. Output of scaling stage for R signal

Figure 10 .
Figure 10.Double CV with   segments in the outer CV and   segments in the inner CV executed to obtain the   .Adapted from [41]

Figure 11 .
Figure 11.Double CV with   segments in the external CV executed to obtain test predictions of all data

Figure 13 .
Figure 13.Histogram of the optimal number of components in the double CV for the PLSR- 2 -R-SNV model

Table 1 .
. The model with 541 the lowest RMSE applied PLSR, used the  2 -transform, using R as the independent variable and  as the pre-processing technique Table1.This model obtained an RMSE of 1.1382 °Brix and an   of 0.5140 in the transformed dimensional scale.RMSE of the 36 models.
Transformations for non-destructive evaluation of brix in mango by reflectance … (Ernesto Paiva-Peredo)

Table 2
shows the expected PCR and PLSR RMSEs and ranges for the Brix variable applying different transformations.The PLSR results are better than PCR when applying the √ transformation.However, 543 insufficient evidence has been found to claim that PLSR is better than PCR for the other transformations.On the other hand,  2 transformation has shown that on average it presents better results than other transformations for both techniques.
Transformations for non-destructive evaluation of brix in mango by reflectance … (Ernesto Paiva-Peredo)

Table 2 .
Average and range of RMSE applying different transformations

Table 3 .
Average and range of RMSE applying different types of feature extraction.

Table 4 .
Average and range of RMSE applying different pre-processing techniques.