Assessing the performance of random forest regression for estimating canopy height in tropical dry forests

ABSTRACT


INTRODUCTION
The evaluation of the structure and dynamics of a forest ecosystem is crucial to understand its functioning and response to different disturbances and climate changes [1].In this context, remote sensing technologies have become fundamental tools to obtain accurate and continuous information on the structure and dynamics of forests [2], [3].Light detection and ranging (LIDAR) and synthetic aperture radar (SAR) have proven to be useful for obtaining precise measurements of the height and structure of forests in different types of forest ecosystems [4], [5].
This article presents an evaluation of the relationship between height metrics obtained through the global ecosystem dynamics investigation (GEDI), LIDAR sensor and SAR images from the L-band advanced land observing satellite-2 phased array type L-band synthetic aperture radar (ALOS PALSAR-2) sensor in the tropical dry forest of Caldas Department, Colombia.To analyze this relationship, the random forest algorithm is used, known for its ability to analyze large datasets and generate accurate predictive models [6].The capacity of this algorithm to estimate canopy height from these different data sources is evaluated [7].
The estimation of canopy height is fundamental for sustainable forest management and understanding forest dynamics [8].However, direct measurements of canopy height can be costly, labor-intensive, and even impossible in remote or extensive areas [9].For this reason, satellite-based approaches have been developed to estimate canopy height at global and regional scales.The use of satellite data can be considered an indirect method for estimating forest canopy heights, which has become increasingly accurate with the evolution of satellite technologies [10].In this context, the GEDI by National Aeronautics and Space Administration (NASA) [11] has stood out for providing valuable information on forest height through medium-resolution time series [12].Although field validation is important to correlate two types of satellite data, the lack of it does not invalidate the usefulness of GEDI as an input for canopy height estimation [13].GEDI datasets are improving height models and offering an important tool for global and regional forest management [4].
In the search to improve the accuracy of canopy height models, a combination of LIDAR and SAR data [14] has been used, which offer detailed information on the structure and dynamics of tropical forests.These data allow identifying patterns in the vertical distribution of forests and improving the accuracy of canopy height models [15].Despite the advances achieved with this methodology, there are still challenges in terms of model accuracy.In particular, the lack of structural information on the forest has been a significant obstacle to improving the accuracy of such models [16].The combination of LIDAR and SAR data offers a unique synergy for the evaluation of canopy height and texture in any type of forest [12], allowing for detailed information on structure and its vertical distribution.SAR data from the ALOS PALSAR-2 sensor [17], through the generation of various polarimetric indices, are suitable for texture detection, improving the accuracy of canopy height models [13].
The estimation of canopy height in tropical forests is valuable for the management and conservation of natural resources, including the identification and monitoring of biodiversity, assessment of forest productivity, and carbon monitoring.GEDI and SAR data provide information at global and regional scales, but machine learning algorithms are needed to fully evaluate the information.The evaluation of the relationship between GEDI height metrics and SAR polarimetric indices using machine learning techniques is crucial for improving the understanding of tropical forests and promoting their sustainable conservation and management.
The article is organized as follows: section 2 presents the materials and methods used for the development of the research.In section 3, the results of the study are shown and discussed.Section 4 presents the conclusions.

Study area
The study area corresponds to the tropical dry forest ecosystem in the eastern part of the Caldas Department in Colombia.This is an area of great biological and environmental importance, which makes it suitable for investigating its dynamics, functioning, and response of this type of ecosystem to environmental changes.It is characterized by a prolonged dry season, high temperatures, and limited availability of water [18].It is in the valley of the Magdalena River, one of the most important rivers in the country, at an elevation ranging from 150 to 400 meters above sea level.The spatialization of the study area was elaborated by the [18], as part of the project called "Conservation status of the floristic component in the tropical dry forest of Caldas," as shown in Figure 1.The tropical dry forest is characterized as an ecosystem with a closed tree canopy and vegetation that varies between semi-deciduous and fully deciduous foliage.Despite its ecological importance, the tropical dry forest has been one of the least studied ecosystems, even though it has been highly intervened by human activity.In this sense, it is crucial to investigate and better understand the dynamics and functioning of this ecosystem in the selected study area in order to generate scientific knowledge that contributes to its conservation and sustainable management.The dry forest area is highly fragmented due to human activity, which has led to the creation of multiple polygons with isolated extensions and distributions [19].To facilitate image processing and data extraction, a quadrant was delimited that covers the multiple polygons.However, it is emphasized that the analyses and results obtained correspond exclusively to the dry forest area.

Data acquisition
In the present research, two satellite images were used: the GEDI and the ALOS PALSAR-2.It should be noted that the image processing and data extraction were performed through Google earth engine (GEE), which allowed for greater efficiency and speed in the image analysis [20]."GEDI is a data collection containing measurements of surface height acquired through the instrument aboard the NASA ICESat-2 satellite [21].This instrument uses a laser system to measure the distance between the satellite and the earth's surface, and from these measurements, vegetation height can be determined [22].The selected image consists of a collection of images obtained through GEE with a spatial resolution of 25 meters and a temporal range from 2019 to 2020.The rh95 metric was used because, according to previous studies, it has shown better results compared to other metrics [13].
The ALOS PALSAR-2 image is acquired by a satellite from the Japan Aerospace Exploration Agency-JAXA [23], which uses a SAR technology in L-band to capture images of the earth's surface [24].The image is a collection obtained through GEE, with a spatial resolution of 25 meters and standard pre-processing provided by JAXA, which includes geometric and terrain slope corrections [25].The image collection is available annually and has been constructed using a mean filter that combines images acquired from 2015 to 2020.

Data processing
The methodological scheme in Figure 2 shows the development of data processing.To process the ALOS PALSAR-2 image, several activities were carried out, including structuring the image collection in GEE.First, the collection was filtered by dates and region of interest to select the most suitable images for analysis and their corresponding composition during the aforementioned analysis period.Subsequently, the Pauli polarimetric decomposition technique was used to obtain a new polarization (VV) from the original polarizations (horizontal receive (HH) and vertical receive (HV)).This technique allows for the decomposition of the polarimetric scattering matrix into three backscattering matrices, which in turn facilitates the analysis of the polarimetric data of the image and highlights important features of the land surface, such as the forest structure of a forest [26].Various polarimetric indices were generated from the HH, HV, and VV polarizations, providing complementary information about the forest composition [10].Polarimetric indices provide a quantitative measure of the polarimetric properties of the land surface and are used to characterize the physical properties of objects in the image [27].Additionally, they can be used to analyze specific features such as the asymmetry of surface backscattering, object shape, and orientation [28].This can be directly associated with the structure and composition of the tropical dry forest.Table 1 presents the polarizations from ALOS PALSAR platform and Table 2 the polarimetric indices employed in the research.
Figure 3 shows a view of the quadrant that includes the study area.On the right, a sentinel-2 image is observed in red, green, and blue (RGB) combination, while on the left, a section of the study area is presented in different frames.In this section, the generated polarimetric indices are visualized in a color scale, and the tropical dry forest area is delimited with a black contour.From the polarizations and the generated polarimetric indices, they were compiled into an image stack that allows all the obtained information to be visualized together.Additionally, it is important to highlight that in order to improve the image quality and reduce the noise produced by the speckle phenomenon, a filtering technique based on the improved sigma lee filter was applied.This technique allowed obtaining sharper and Assessing the performance of random forest regression for estimating … (Christian Javier Pinza-Jiménez) 6791 clearer images, facilitating the identification and analysis of objects and features present in the image [29].Finally, two image conversions were performed to improve interpretation Figure 4: from linear units to decibel units to expand the dynamic range and highlight areas of higher or lower signal intensity, and from decibels to power backscatter to show the amount of energy backscattered by objects in the scene, following the methodology of study [10].
A two-step process was carried out to process the GEDI image collection and ensure accurate data integration.First, the collection was structured using GEE and the rh95 metric was selected as the variable of interest.Subsequently, geospatial processing procedures were carried out for vectorization of the selected metrics, followed by their export as a shapefile.To ensure spatial consistency and accurate alignment, co-registration of GEDI and ALOS PALSAR imagery was performed prior to integration into a multiband raster file.This process ensured that each pixel of both images corresponded to exactly the same geographic location within the dry forest study area [30].The final stack included the height metrics (rh95) and polarimetric indices mentioned previously.Once the shapefile was imported as an asset to GEE, the corresponding data from the image stack was extracted.This allowed for the creation of a CSV database with the GEDI and ALOS PALSAR-2 data, which was used to structure the regression algorithm in Python in Google Colaboratory, allowing for detailed and precise analysis of the information.

Structuring of the random forest algorithm
After importing the shapefile as an asset in GEE, an exploratory analysis of the dataset was conducted, which included obtaining descriptive statistics as well as creating and analyzing histograms and boxplots for each variable.The goal was to evaluate the presence of missing and outlier values.Values that could affect the quality of the results were removed.After the exploratory analysis, various activities were carried out to further analyze the obtained data.Firstly, the Shapiro-Wilk test was used to evaluate the normality of the data [31].Secondly, Spearman correlation was calculated to evaluate the nonlinear relationship between two variables, and scatter plots were generated to visualize the relationship between canopy height and SAR bands [32].These activities allowed identifying the key relationships between the variables, which contributed to the selection of the most relevant variables for the structure of the random forest algorithm.
Once the variables were selected, we proceeded to implement the random forest model to predict canopy height.Using the Scikit-learn library in Python, the independent variables ('HH', 'HV', 'VV', 'RVI', 'RFDI', 'NDBI', 'AI', and 'DI') were assigned to the X variable, while the dependent variable 'GEDI rh95' was assigned to the Y variable.The model implementation was based on the random forest regressor class of Scikit-learn.To improve the results, a function was created that allowed iterating between different values of hyperparameters such as number of trees, maximum depth and random seed, allowing to select the model that best fit the data.This strategy helped to select the model that best fit the data and optimally captured the relationships between the independent variables and the dependent variable.In addition, a 5-iteration cross-validation was applied to assess the accuracy of the model.This technique allowed comparing the results obtained with the values of the rh95 metric and evaluating the predictive capacity of the model using a specific metric on different data sets to reduce the risk of overfitting and increase the reliability of the results.Subsequently, the model was fitted using all training data and predictions were made.Evaluation metrics, such as the coefficient of determination (R2), root mean squared error (MSE) and root mean squared error (RMSE), were calculated to assess model performance.In addition, graphs comparing actual values with predicted values were generated, allowing visualization of the quality of the predictions and the relationship between observed and estimated values.

RESULTS AND DISCUSSION
After carrying out the descriptive data analysis that included several activities, such as obtaining descriptive statistics, generating and analyzing histograms and boxplots, and removing outliers for each variable, the Shapiro-Wilk test was performed to assess the normality of the data.The initial number of records in the dataset was a total of 1,028.After removing outliers, a total of 957 records remained in the dataset Figure 5.For this analysis, a significance level of 0.05 was used to evaluate the normality of the data using the Shapiro-Wilk, Anderson-Darling, Cramér-von Mises, and Kolmogorov-Smirnov tests.This test generated a test statistic and a p-value, where if the p-value is less than 0.05, the null hypothesis that the data come from a normal distribution can be rejected.After applying the test, the variables HH, NDBI, AI, DI, and GEDI rh95 result in not normality.Upon finding that more than 50% of the data did not meet the normality assumption, it was decided to use non-parametric statistics.Therefore, the Spearman correlation test [32] was performed to evaluate the relationship between the variables of interest, which is suitable for non-normally distributed variables.However, it was also decided to calculate the Spearman correlation matrix to observe the positive and negative correlations between the variables.After applying the Spearman correlation test to the research data, a correlation matrix was generated that allowed the identification of positive and negative correlations between variables.This is important for selecting the variables to be included in the model for estimating heights with GEDI and ALOS PALSAR-2, avoiding the inclusion of variables with a high degree of correlation among them, which can affect the accuracy of the model and generate multicollinearity.The Spearman correlation matrix showed that the dependent variable "GEDI rh95" positively correlates with the variable RVI and negatively correlates with the variables RFDI and NDBI.The p-values of these correlations were significant, suggesting that these variables are relevant in estimating forest height.These correlations are also useful in identifying variables that have high correlation with each other and reducing the complexity of the model.

Random forest model application
The random forest machine learning algorithm was used, which combines the prediction of multiple decision trees to improve accuracy and reduce overfitting [6].The scikit-learn library of Python was used to build the model, and the k-fold cross-validation technique was applied to evaluate its predictive ability.For each independent variable (HH, HV, VV, RVI, RFDI, NDBI, AI, and DI), a for loop was executed to build a model and obtain specific metrics and scatter plots.A random forest model with 100 trees was constructed, and 5-fold cross-validation was used for each independent variable.This process involves dividing the data into k subsets and training the model on k-1 subsets, using the other subset as a validation set.This was repeated k times to obtain a more accurate evaluation of the model's generalization ability.
Scatter plots were generated for each of the eight independent variables obtained from the SAR bands and vegetation height obtained by GEDI (rh95) to analyze their relationship.Each plot allows visualizing the relationship between an independent variable and vegetation height in the study area, and a total of eight plots were obtained, one for each independent variable.Subsequently, the model was fit with all training data, and predictions were made with all data.To evaluate the predictive ability of the model, evaluation metrics (R2, MSE, and RMSE) were calculated for each independent variable.The results were plotted, showing the actual vs. predicted values for each independent variable in a scatter plot Figure 5. Overall, a positive relationship was observed between the actual and predicted values for all independent variables.Figure 6 shows the eight scatter plots, where the corresponding independent variable for each SAR band is on the x-axis, and the vegetation height (rh95) obtained by GEDI is on the y-axis.Each plot shows the scatter plot of data points and the fitted regression line, which represents the relationship between the independent variable and canopy height.
Based on the results obtained, it can be observed that the variables with the highest relationship to GEDI rh95 are NDBI and RVI, with R2 values of 0.76 and 0.72, respectively.These polarimetric indices are related to the response of the tropical dry forest because NDBI is based on the difference in backscattering between horizontal and vertical waves, allowing the detection of degraded areas [33].In the case of RVI, this index is related to the density and vertical structure of vegetation, making it a good measure for evaluating forest biomass [34].
On the other hand, the variables with a lower relationship to GEDI rh95 are VV and HV, with R2 values of 0.53 and 0.57, respectively.These polarizations are related to the response of the dry forest because SAR polarimetry is sensitive to the structure and geometry of vegetation, allowing the detection of changes in forest structure [27].However, in this case, it can be observed that the relationship with GEDI rh95 is lower, which may be due to the low density of vegetation in the tropical dry forest ecosystem.Regarding the other variables, it can be observed that they have a moderate relationship with GEDI rh95, with R2 values ranging between 0.61 and 0.64.These polarimetric indices are related to the response of the tropical dry forest because SAR polarimetry allows for the characterization of vegetation structure and geometry, which can detect changes in forest structure.However, it is observed that the relationship with GEDI rh95 is lower than that of NDBI and RVI, suggesting that these indices may not be as sensitive for evaluating forest biomass in the tropical dry forest ecosystem.Taking this into account, NDBI is the variable with the best relationship for predicting canopy height measured from rh95.Performance plots of the model are presented in Figure 7.In the real value distribution plot, it can be observed that most values are between 2.5 and 30, with a higher concentration around 17.5.On the other hand, the predicted value distribution plot shows a similar distribution, which suggests that the model can predict rh95 values with some precision.In the error distribution plot, it can be observed that most errors are around 0, indicating that the model does not have a systematic tendency to overestimate or underestimate rh95 values.
In summary, it can be concluded that the NDBI variable is the one that best explains the relationship with rh95, as evidenced by the high R2 value obtained.The evaluation of the performance graphs supports this conclusion by showing a similar distribution between the actual and predicted values, and a low concentration of errors around 0. The RF model with this input variable shows a good fit and predictive capacity and can be used to estimate forest height with reasonable accuracy.

CONCLUSION
After carefully examining the study's findings, it can be said that using polarimetric indices to estimate vegetation height is a useful and promising method.Estimates of the height of the forest canopy can be more accurately and thoroughly determined thanks to the extra information about the structure and composition of vegetation that is provided by polarimetric analysis of SAR data.In particular, the regression between polarimetric indices and the dependent variable GEDI rh95 has found use of the RF algorithm to be a valuable

Figure 1 .
Figure 1.Location of the study area, tropical dry forest, Caldas, Colombia

Figure 2 .
Figure 2. General workflow used in the research

Figure 4 .
Figure 4. On the left is a visualization of the ALOS PALSAR-2 image in polarization combination.In the center, all metrics within the quadrant are presented.On the right, the metrics within the specific dry forest area are shown

Figure 5 .
Figure 5. Box plots for each variable before and after outlier cleanup

Figure 6 .
Figure 6.Scatter plots for each independent variable and its relationship to rh95

Figure 7 .
Figure 7. Performance graphs of the NDBI variable with the rh95 metric