Self-organizing map (SOM) for species distribution modelling of birds species at Kenyir landscape

Identifying which biodiversity species are more dominant than others in any area is a very challenging task. This is because of the abundant of biodiversity species that may become the majority species in any particular region. This situation create a large dataset with a complex variables to be analysed. Moreover, the responds of organisms and environmental factors are occurred in a non-linear correlation. The effort to do so is really important in order to conserve the biodiversity of nature. To understand the complex relationships that exist between species distribution and their habitat, we analysed the interactions among bird diversity, spatial distribution and land use types at Kenyir landscape in Terengganu, Malaysia by using artificial neural network (ANN) method of self-organizing map (SOM) analysis. SOM performs an unsupervised and non-linear analysis on a complex and large dataset. It is capable to handle the non-linear correlation between organism and environmental factors because SOM identifies clusters and relationships between variables without the fixed assumptions of linearity or normality. The result suggested that SOM analysis was suited for understanding the relationships between bird species assemblages and habitat characteristics.


INTRODUCTION
Describing the suitable environmental conditions for biodiversity species is an approach to assess the geographic distribution of the species [1]. The reason is that distribution of species is related to environmental condition. Environmental conditions that are suitable for the species must be described and identify the area to match species occurrence data [2]. The environmental data is describes by the data of climate and topography in a form of technical, spatial and temporal data types [3]. There are two types of input that are required for predicting the species distribution which are the locations of occurrence of species and environmental variables. Both inputs must be fit into a process to quantify and estimate the geographic distribution of a species which is known as species distribution modelling (SDM). SDM methods are numerical methods that will quantify the correlation between the data of species with environmental factor [4,5]. The task to assess the species distribution can be done by applying a community ecology approach that studies the species assemblages of a region [6]. Species assemblages are classifications of species that co-occur in the same place and at the same time [7]. The benefit of the classifications is that the stability of species assemblages may be used to define representative and reference sites for biological assessment [8]. Assemblages of co-occurring invasive species from any source regions are used to identify and highlight new species threats for the related region [9]. In that way, a large number of species can be analysed concurrently and ranked in a particular region based on species correlations.
The data of species with environmental factor consist of a huge number of variables. Thus, it is difficult to highlight the data pattern in the dataset [10,11]. These data are very complex to be analysed because of a large amount of species and the respond of organisms to environmental factors is a non-linear correlation [12]. To simplify the analysis of a complex dataset, classification techniques are commonly used. Most statistics methods handle the data as a linear data analysis. The linear analysis is not very efficient to cater the non-linear correlation behaviour between organisms and environmental factor in an optimum way [13].

RELATED WORKS
SOM is an unsupervised learning and non-linear method to analyse various data sets, including those with missing values [14,15]. The assessment of community patterns and correlations may be more accurate than the common statistical methods. The SOM method has been widely used to identify species community patterns and their relationships with environmental conditions in aquatic habitats. However, it has not been fully used for terrestrial habitats and for biodiversity conservation planning [13].
The use of artificial neural networks (ANN), such as self-organizing maps (SOM), proved to be useful and efficient in many climatic, environmental, ecological applications and speciesdistribution [10,12,16,17]. ANN can predict the target values based on the learning process [18,19]. The widely used of SOM analysis in these domains is because of the SOM ability to correlates the variables without depending on the fixed linearity assumptions. A study has been done on ant communities by clustering and analysing indicator species using SOM [12]. The results assess the understanding the relationships between ant communities and species and habitat characteristics. SOM has been applied to visualize broad patterns in species richness by taxonomic group [13]. SOM have also been used in the modelling of water quality and resources. A study has been done on the application of SOM techniques combined with K-means clustering to model water quality data in the treatment of drinking water [14]. An effort have been undertaken to describe the state of local water bodies and pollution in the area of interest with the help of SOM [20]. Tsai et al., [21] have explored the relationship between fish community and water quality by SOM which is suitably reflects the spatial characteristics of fishery sampling sites. Self-organizing map (SOM) combined with the K-means algorithm is used to determine the spatiotemporal pattern of the water quality data and identifying the sources of pollution in the Klang River basin [22].
In this study, we describe how SOM can be applied for the study of ecological communities and distribution of bird species. The presentation of SOM adapted to biodiversity data is described by training SOM on example data of Kenyir landscape in Terengganu, Malaysia. Selecting the most ecologically representative species is a method that enable to decrease the size of large datasets based on the preferred land use type of each particular species. A self-organizing map (SOM), which is an unsupervised artificial neural network, can be used to generate values that indicate the strength of association of a species with a species assemblage.
SOM is an artificial neural network (ANN) method that learn to cluster groups of similar input patterns for a high dimensional input space in a non-linear fashion onto a low dimensional output layer [15]. It calculate the probability density function of input data using an unsupervised learning algorithm. The method also enable to do clustering, visualizing and abstraction of complex data. The main objective of data visualisation is to gain insight into information by mapping data into an understandable graphical form [23]. Figure 1 shows the structure of an SOM consist of two layers: an input layer and output layer. The input layer contain one neuron for each variable in the data set. The output layer neurons are connected to every neuron in the input layer through adjustable weights of parameters.

PROCEDURE OF SOM APPLICATION
The learning steps are described in detail to get the insight of the SOM method to be applied in the case study. The procedures required to apply SOM including main steps:

Data gathering and normalization
The methodology of sampling for data gathering is described in the next section of Sampling Methods. After the data gathering, data normalization is done to ensure that all variables have the same rank in the development of the SOM. By normalizing the data, the SOM may visualizes overall performance of the species distribution and assemblage in a better way. Data normalization causes SOM neurons to expanse toward highlighted parts of the data distribution.

Data training
An input vector from the data matrix is introduced to the iterative training procedure to form the SOM. The SOM uses a type of learning that is called competitive, unsupervised, or self-organizing procedure to match each input vector with a neuron in the SOM [16]. The neuron with the closest match to the presented input pattern is called winner neuron or best matching unit (BMU). The winner and its neighbours predefined in the algorithm update their weight vectors according to the SOM learning rules as follows: Where w ij (t) is a weight between a node i in the input layer and a node j in the output layer at iteration time t, α(t) is a learning rate factor which is a decreasing function of the iteration time t, and h jc (t) is a neighborhood function (a smoothing kernel defined over the lattice points) that defines the size of neighbourhood of the winning node c to be updated during the learning process. At the final stage the reference vectors of all the activated neurons are updated.

Extracting information from the trained SOM
The result can be visualized and the data can be clustered when the training of SOM is done. The trained SOM map is a visualization tool that will give the vision of the data exploration. The first-level SOM clusters input data into a certain number of clusters and the second-level clusters the output neurons into as many different regions as required.

SAMPLING METHODS
The analysis of SOM has been done to the dataset of birds in Kenyir landscape, Terengganu, Malaysia from 2015 to 2016. There are 53 species of birds from 20 family types found in the study site. There are two types of sampling method that have been used to collect the data which are point count and mist netting.

Point count
Ralph et al. [24] stated that point count is one of the most commonly used method in studying the abundance, distribution and ecology of forest birds. Point count was chosen because it enable observers to locate and observe rainforest birds through standing at a fixed location in a fixed time which also helps in identifying birds that are hard to observe. All observed birds were identified up to species level and extra information such as status of distribution, protection status and conservation status of IUCN Red List 2014 was referred from published materials. Birds identification was aided by Robson [25], Strange and Jeyarajasingam [26], Davison and Fook [27] and Wong [28] with additional information by Wells [29].

Mist netting
Mist netting is another method that is more specific in terms of identification that also includes cryptic species, sampling of genetic materials, parasite and morphological data collection on captured birds [30]. Mist-netting is also vital in order to reconfirm the species observed during point-count survey and it benefits in capturing individuals of several species [31]. In addition, this method focus more on the results of species distribution rather than abundance [32]. Nets data were recorded such as their height and GPS reading. Birds were identified until species level [33] and the external morphological data of captured birds were recorded including their brooding patch, moulting stage and the nets that they were captured. The birds were marked before released at the captures sites. Captured birds were released nearby to the spot they were captured to avoid disturbance to their daily activities [34].

STUDY SITES
The various study sites were chosen due to its different forest types in order to determine the relationship of the diversity of birds according to specific habitat types. The study were carried out at 11 sites in Kenyir landscape which are Tanjung Mentong, Buweh River, Belukar Bukit 1, Sekayu Waterfall, Kenyir Research Station, Belukar Bukit 2, Sekayu Agriculture Park, Pusat Latihan Rela Wilayah Timur 1, Pusat Latihan Rela Wilayah Timur 2, Saok Waterfall and Belukar Bukit 3. Table 1 represents the code of the study sites that will be applied in the SOM. The study sites were classified into six land use types which are forest vegetation, low land dipterocarp forest, melalueca forest, orchard, peat swamp forest and plantation. Each land use type presents birds with a variety of different habitats.

DATA ANALYSIS
We used the SOM algorithm to characterise the distribution patterns of birds. The structure of a SOM consists of an input data layer and output data layer. The input layer is composed of 53 neurons (one per bird species) connected to the 4 sampling datasets. Training of the SOM and the clustering procedures were performed using Matlab.

STRUCTURING INDEX
The SI was originally developed to define species showing the strongest influence on the organization of the SOM map. The SI is the value indicating the relative importance of each species in determining the distribution patterns of the samples in the SOM. Therefore, the set of species showing high SI can be considered as the indicator species.
The SI is calculated from the sum of the ratios of the distance between the weights of all species in the SOM and the topological distance between two SOM units. This results in representing distribution gradients for each species in the trained SOM. A structuring index of species i, SIi, is expressed in the equation as follows: Where w ij and w ik are the connection weights of species i (in the input layer) in SOM units and k,||r j −r k || is the topological distance between units j and k, and S is the total number of SOM output units. SI considers the distribution gradients of each species in the SOM map. Species showing a strong gradient display a high SI value, whereas species showing a weak gradient present a low SI value. Thus, the higher the value of SI, the more relevant the variable is to the structure of the map.

SOM MAP GRID
The SOM map specify the numbers of rows and columns in the grid. We determine the number of rows and columns is set to 10. Figure 2 shows the SOM map grid for output neuron. The default topology of the SOM is hexagonal. This figure shows the neuron locations in the topology, and indicates how many of the training data are associated with each of the neurons (cluster centers). The topology is a 10-by-10 grid, so there are 100 neurons.

CLUSTERING LAND TYPE USE
There are 53 elements in each input vector which is one per bird species. The weight vectors which is cluster centers fall within the output vector space of the 100 neurons. An input vector is a vector with features of variables and attributes whereas the attributes represented by 11 sites of sampling. The input vector is correlated with each unit. The SOM map grid shows all possible analysis using a defined set of output vectors. Meanwhile, the input vectors will be sorted on the grid so that similar input vectors are close to each other and dissimilar input vectors are far from each other. A visualization tool for the SOM is the weight distance matrix as in Figure 3.
In Figure 3, the blue hexagons represent the neurons. The red lines connect neighbouring neurons. The colors in the regions containing the red lines show the distances between neurons. The darker colors represent larger distances, and the lighter colors represent smaller distances. The SOM network appears to have clustered the land use type into four distinct groups as in Figure 4.  These clusters reflected the land use type at each of sampling site which are cluster 1 corresponds to forest vegetation. Cluster 2 corresponds to peat swamp forest. Cluster 3 corresponds to low land dipterocarp forest and melalueca forest. Cluster 4 corresponds to orchard and plantation. The trained map of SOM consisting of nodes related to their output vectors is presented by two-dimensional (2D) component planes. The weight vector related to each neuron will transfers to become the cluster center of an input vector. The variable distribution values are represented on the trained map by divided sections. Each section of the variable distribution is differentiated by different colours. These component planes could be compared in order to identify similarities between variables. Assumption can be made that the inputs are highly correlated if the relation patterns of two inputs were very similar. The component planes representing probability of occurrence as shown in Figure 5.  malaccense, Stachyris poliocephala, Prionochilus percussus, Aegithina tiphia has been reported to be particularly common in cluster 2 which corresponds to peat swamp forest and cluster 3 and 4. Species of Geopelia striata, Philentoma pyrhoptera and Erythrura prasina was shown to be in cluster 1 which is corresponds to forest vegetation. The availability of habitat such as the characteristic of the vegetation and the food sources has a large impact on bird distribution and birds depend solely on their habitat for survival [35,36]. Therefore, the analysis result shows the certain habitat that have the food source from the vegetation which is a crucial indicator to conserve the bird species that have the significant role in ecology. Also, the presence of birds in a particular area is mainly influenced by other crucial characteristics such as suitable habitat, shelter, less competition and predation [33]. Ramli et al. [33] also stated that bird population size in a specific area is mainly affected by food supply, seasonal breeding, migratory activities and habitat changes. Habitat selection of different birds varies from species to species. Birds may respond to different vegetational structure [37,38].

CONCLUSION
The SOM was found to be an effective analytical tool for understanding the responses of bird species to different habitat and assess the relationship of each species assemblages in each land use type. Patterns and relationship in ecological communities can be represented visually and the unpredicted structures that might be revealed by the analysis also can be discovered. However, SOM has a limitation to classify data because every SOM is dissimilar [7]. It will find a small different similarities among the sample vectors each time the preliminary conditions are changed. In order to handle this limitation, the change in rank of each species must be observed and identified to generate a valid SOM ranks each time the data are resampled and a new SOM model created. Despite the limitation, SOM and machine learning methods may be efficiently applied for a complex and nonlinear data. It may provide more accurate assessments of species pattern and ecology processes than traditional statistical methods. The application of SOM methods in the correlation between species distribution and ecology assessment could be an effort to develop strategy for conserving biodiversity environments.