Remote sensing in the analysis between forest cover and COVID-19 cases in Colombia

ABSTRACT


INTRODUCTION
In 2019, after several years, the world again experienced a pandemic caused by the coronavirus disease 2019 (COVID-19) disease, caused by the SAR-CoV-2 virus [1].To mitigate the spread of the virus and reduce the risk of the population suffering from COVID-19, a large portion of the world´s governments implemented mitigation strategies [2], [3].On the other hand, the research and academic sectors began to develop research to understand the relationships between the spread of the virus and different factors, such as environmental [4]- [6], demographic and economic [7]- [9], age-related, climatological, people with the Bacillus Calmette-Guérin (BGC) vaccine, and people under treatment for malaria [10].Likewise, different authors have worked on the development of mathematical models that allow modeling and estimating the long-term spread of COVID-19 [11].
Spatial data mining techniques allow the extraction of information from geographic data to perform correlation analysis [12] between different variables, such as: land use and forest fire risk [13], the spatial distribution of forest cover and environmental variables [14], and the quantification of vegetation types using multitemporal analysis [15].Thanks to the above and considering that, remote sensing, it is possible to carry out vegetation cover analysis [16]- [18], this is another factor studied concerning the spread of COVID-19.

733
Thus, another factor that has been studied with regard to COVID-19 are the green areas present in cities, in such a way that it can be defined if the existence of these spaces mitigates or not the spread of the virus.Studies focused on this factor have managed to establish that there is a negative correlation between the existence of green spaces and the infection rate, that is, the number of infections is lower in areas where there are more green spaces [19]- [21].If the percentage of vegetation coverage increases, the spread rate of the virus decreases, according to numerical data.If the coverage increases by 1%, the number of accumulated cases of COVID-19 decreases by 2.6%, according to [22].The preceding is consistent with studies carried out on the relationship of some respiratory diseases with green areas, such as the case presented in [23], where it is suggested that growing up in urban areas with large green areas reduces the risk of developing asthma thanks to the macrobiota present.Considering the importance of green regions to mitigate the chances of developing some diseases, some authors propose incorporating biophilic elements into urban design, such as green roofs, which could help improve air quality [24].In this article, using remote sensing data associated with forest cover, the results of a study on its relationship with COVID-19 are presented, applying various phases of the analytics solutions unified method for data mining (ASUM-DM), methodology for data analysis projects.Remote sensing data have allowed modeling the spread of diseases from a spatial and temporal dynamic [7], [25].Finally, this article analyzes the association between COVID-19 cases recovered, deceased, and forest cover.The research hypothesis that "greater forest cover is associated with a lower number of COVID-19 cases" was verified in a case study in Colombia.

METHOD 2.1. Study area
The study area is Colombia, a country comprising 32 departments according to its internal political division.Colombia is located in the northwest of South America.The country has a Gini index of 0.53 [26], indicating a state of neither perfect equality nor perfect inequality.However, based on this, it can be inferred that a significant portion of the population lacks sufficient economic income.Colombia exhibits a diverse range of land cover types across its five regions: the Caribbean region, characterized by grasslands, wetlands, and bare soils; the Pacific region, known for its high rainfall and lush rainforests; the Orinoco region, dominated by vast savannahs on a flat landscape; the Amazon rainforest area, featuring dense tropical jungles; and the Andes Mountain range, comprising heavily modified natural ecosystems, extensive agricultural cultivation, road networks, and urban areas [27].

Data sources
The data that was used in this research and the respective sources are presented in Table 1.The population data for the study area was obtained from the 2018 population census report carried out by the national administrative department of statistics [28].This makes it clear that the most recent census in Colombia is the one mentioned.The number of cases of COVID-19 is obtained from the National Institute of Health of Colombia, also known as Instituto Nacional de Salud (INS) by its name in Spanish [29].The land cover data for Colombia were obtained from the National Institute of Hydrology, Meteorology and Environmental Studies (IDEAM) [30] and geographically processed using QGIS software [31], [32].The IDEAM employs the CORINE Land Cover methodology [33], [34] to generate the land cover classification.Finally, the forest cover data was obtained from the global forest/non-forest map (FNF) dataset [35] implemented in google earth engine (GEE).

Methodology
To start, the study area is defined, in this case, Colombia, and a time window is established from October 1, 2020 to December 31, 2021, the window for which the primary data of the study is covered.Subsequently, data from the study area on the behavior of the COVID-19 pandemic, in which the number of cases is identified, are obtained from the information system of the National Institute of Health in Colombia.Forest data is also obtained for the study area for the established time window, and this study focused on analyzing the five main categories of land cover: i) artificialized territory (CLC1), ii) agricultural territories (CLC2), iii) forests and semi-natural areas (CLC3), iv) wet areas (CLC4), and v) water surfaces (CLC5).For the execution of the study, some stages of the analytics solutions unified method for data mining/predictive analytics (ASUM-DM) methodology created by IBM in 2015 [37] were applied in an adjusted way for a better understanding.This can be observed in Figure 1.

Figure 1. Methodology flowchart
The analytical approach of the problem phase is carried out to identify the variables that are required in the study.Subsequently, the data requirement phase establishes these variables with their respective characteristics.Once the variables necessary for the study with their respective characteristics were defined, the data collection phase was carried out, during which the different sources where the variables were found were accessed.After obtaining the data from the variables, an initial exploratory analysis is carried out in the data comprehension phase, to observe the variables' behavior.Once the behavior of the variables was known, the data preparation phase was executed, where processes such as cleaning and standardization of erroneous and atypical data were carried out.Finally, an analysis of the data is generated in the phase of the construction of the model-analysis.

RESULTS AND DISCUSSION
The analytical approach to the problem suggests the need to understand the relationship between the variable "case density" and the variables "forest density" and "land cover" in order to identify the impact of forest coverage on the number of COVID-19 infections.Accordingly, the COVID-19 dataset was filtered and consolidated, specifically by departments, using the R software.The land cover data were processed in QGIS using the "Join attributes by location" data management tool with the geometric predicate "contains."This methodology allows for associating each geographical unit with the corresponding land cover categories within its boundaries.Subsequently, the land cover data were merged into the preexisting COVID database to create an integrated dataset.Following the data cleaning process performed with R and Excel, the data comprehension phase commenced, including an exploratory analysis that revealed the findings in Table 2.As observed in the table as mentioned earlier, the Shapiro-Wilk test indicates that the analyzed dataset is not normally distributed, as all the p-values from the Shapiro-Wilk test are less than 0.05.However, the forest density among the variables shows the closest approximation to a normal distribution, with a Shapiro-Wilk p-value of 0.102.Given the purpose of this study and considering the rejection of the normality assumption of the data, Pearson's correlation coefficient was applied, obtaining the values indicated in Table 3.As can be seen, there is a Pearson index  of -0.439 (p-value<0.012) between the variables forest density and cases density.On the other hand, between the variables case density and CLC3, a Pearson index  of -0.383 (p-value<0.031) was obtained.Results in Table 3 indicate an inverse trend between forest density and the variables cases density and CLC3, with departments with high forest cover presenting a low number of COVID-19 cases, thus validating the hypothesis raised for the development of this article.This is consistent with what was proposed in [19], [20] where green areas are associated with the low severity of COVID-19 infections and the low transmission rate, thanks to optimum air quality.Considering the correlation obtained for the CLC3 variables, an analysis was conducted that encompassed the vegetation covers within category 3.These covers include, first, those of a forest type (code: 31); second, those of a shrubby and herbaceous type (code: 32); and third, territories composed of bare soils, rocky outcrops, and sandy areas (code: 33).The analysis reveals a significant correlation (=-0.368,p-value=0.038)between the density of COVID-19 cases and coverage 31, which corresponds to forested areas as shown in Table 4.This correlation value aligns with the findings of  The correlation between the cases density of COVID-19 and the CLC3 coverage is visually presented in Figure 2 and Figure 3.The graphical representation reveals that certain departments with a high case density (depicted in an intense red color in Figure 3 exhibit low CLC3 coverage (depicted in a very light green color in Figure 2).The foregoing can also be verified in Figure 4, where the trend of a decrease in cases is observed as the area of CLC3 coverage by department increases.In a more detailed way, Figures 5, 6, and 7 graphically present the relationship between the three coverages analyzed at the CLC3 level.It is possible to observe the negative relationship that exists between each of the coverages and the density of cases, as shown in Table 4.This supports the idea that in places where there is a reduced surface extension of the coverages under analysis, there were quite a few cases of COVID-19.Finally, unlike other studies, the present study was able to present results of the correlation between five categories of land cover and the density of cases of COVID-19.The cover that presents the highest negative correlation is CLC3 associated with forest; although this has been reported in other articles, in the present article, it is evident that depending on the type of forest, the correlation with the cases of COVID-19 will be the same.Thus, in areas with forest cover (code 31), the highest negative correlation is presented; this cover includes natural forests and plantations, and at the local level in Colombia, it also includes natural biological forms such as palm and guadua.Then, a lower negative correlation is found, associated with herbaceous and/or shrub vegetation cover (code 32), including low vegetation and tangle vegetation for Colombia.Finally, the lowest negative correlation is found with the cover of open areas with little or no vegetation (code 33), particularly bare and burned soils or soils covered by ice and snow.


ISSN: 2088-8708 Int J Elec & Comp Eng, Vol.14, No. 1, February 2024: 732-740 736 the previous analysis conducted using forest density data obtained from the FNF map, further confirming the inverse relationship between the number of COVID-19 cases and the extent of forest presence in the study areas.

Figure 4 .
Figure 4. Relation between area of CLC 3 and cases density

Table 1 .
Data sources

Table 3 .
Pearson's Correlations of the variables case density, forest density, and the 5 cover categories

Table 4 .
Pearson's Correlations of the case density variables and the three categories of the CLC3 level of coverage