Development of a Spatial Path-Analysis Method for Spatial Data Analysis

Wiwin Sulistyo1,3, Subanar2, and Reza Pulungan1 Dept. of Computer Science and Electronics, Faculty of Mathematics and Natural Sciences, Universitas Gadjah Mada, Indonesia Dept. of Mathematics, Faculty of Mathematics and Natural Sciences, Universitas Gadjah Mada, Indonesia Faculty of Information Technology, Universitas Kristen Satya Wacana, Indonesia Emails: wiwinsulistyo@staff.uksw.edu, subanar@ugm.ac.id, pulungan@ugm.ac.id


INTRODUCTION
Path analysis is a problem-solving method with statistical approach developed by Sewal Wright in 1921 [1,2]. It can be used to parse and explain relationship or effects between independent (exogenous) variables and a dependent (endogenous) variable involved in a particular problem. Path analysis is able to deliver results in the form of a diagram that represents the relationship between the variables, either by direct or by indirect correlation, so that the total effect can be calculated with statistical approach in the form of a causal or structural model [1,3,4,2]. The usual approaches for path analysis are correlation and regression [5,6]. Some studies have made use of path analysis, among others: the identification of the effect of motivation on math score achievement in reformation-based curriculum [7], the measurement of the effect of e-government utilization on the profits of business ventures in Dubai [8], and the identification of the effects of job stress (time stress and anxiety), job satisfaction, and work motivation on Occupational Health and Safety Administration (OSHA) in Texas [9].
It is interesting to study how path analysis can be used to calculate or process spatial data. Spatial data are a type of data taken in a specific location; thus, they may contain attributes and location information, and hence a kind of dependent data [10]. Spatial analysis is a quantitative analysis that involves spatial data, which include a pattern of dots, lines, areas, etc. in the form of maps and coordinates in two or three-dimensional spaces [11]. The spatial effects in data attributes that geographically have relevance must be identified based on the spatial pattern description and relationships [11,12]. The spatial effects cause the appearance of spatial dependence. Spatial dependency becomes a very important issue in determining the accuracy of the correlation analysis and the regression model built. To produce a proper correlation analysis and regression model we must consider the effects of spatial dependencies in cases that have a spatial tendency, because the regression model will be considered invalid if we ignore the existing Journal Homepage: http://iaescore.com/journals/index.php/IJECE spatial effects. Path analysis, therefore, needs to consider the spatial effects on variables that are calculated, so that the resulting model has a better accuracy. Two statistical approaches commonly used in spatial analysis are spatial autocorrelation and spatial regression. Given a set S containing n geographical units, spatial autocorrelation refers to the relationship between some variables observed in each of the n localities and a measure of geographical proximity defined for all n(n − 1) pairs chosen from n [13]. Spatial autocorrelation has been used to analyze the pattern of migratory attack of brown plant hoppers (nilaparvata lugens) in Boyolali, Klaten, Sragen, and Karanganyar districts in Indonesia [14]. Additionally, spatial autocorrelation has also been used to determine the risk factors for outbreaks of Highly Pathogenic Avian Influenza (HPAI) in Bangladesh [15]. Spatial regression, on the other hand, is widely used to identify and examine the magnitude of spatial effects between independent variables and a dependent variable [16,17]. Spatial regression is a method for analyzing the relationship between an outcome of dependent variable Y and one or more independent variables X and allows for spatial dependence in their observations [18]. Spatial regression can be identified by detecting the condition of spatial dependency based on the error (residual) from the regression model that is initially formed [19].
Spatial autocorrelation and spatial regression have been used to identify patterns of relationship and levels of influence between several regions for some particular variables. Spatial autocorrelation, however, is only capable to identify patterns of relationship that occur, while spatial regression produces values of coefficients (β) that describe the magnitude of the direct influence of independent variables on the dependent variable. Several things cannot be done with this approach, namely how to identify the indirect effects of the independent variables on the dependent variable to produce the total effect, and how to build a model that is able to combine the results of spatial autocorrelation and spatial regression. Spatial path analysis is needed to identify the direct and indirect effects so that the total effects between variables, based on the calculation of spatial autocorrelation and spatial regression, that make up the causal (structural) model can be studied.
This study develops a method derived from path analysis to unravel the relationship and effects between variables to identify the spatial effects that occur. Previously this cannot be accomplished by the standard path analysis method. We call this new method spatial path analysis. To demonstrate the feasibility of the proposed method, we apply it to analyze the spatial effects in determining rice productivity and harvest area based on several factors, namely area of the region, rainfall, water spring, rice field area, and population. The end result of this study is expected to produce a spatial model that is able to identify and analyze the effects of each independent variable towards the dependent variable by involving the spatial elements therein.

PATH ANALYSIS
Path analysis uses correlation and regression analyses to explain the relationship between the observed variables. Many problems in computer science have made use of statistical approaches based on correlation and regression [20,21]. This analysis computes direct and indirect effects between independent and dependent variables to produce a total effect in the form of structural equations. The stages of path analysis are depicted in Figure 1.

Path model
Path analysis begins with determining a path model, which describes relationship between independent and dependent variables. A path model determines a causal model between the variables based on a hypothesis that represents the causals that have been formulated in the path analysis. The causal model is constructed based on a theory or the results of previous research concerning relationship of the variables used in the study. Figure 2 shows several examples of path models. Figure 2(A) shows the relationship among variables X1, X2 and Y, where X1 directly affects X2 and X2 directly affects Y, but X1 only indirectly affects Y. Figure 2(B) shows that both X1 and X2 directly affect Y, but X1 and X2 do not affect each other. Figure 2(C) illustrates that X1 directly affects Y and X1 also indirectly affects Y through X2. Figure 2(D) shows that X1 and X2 directly affect Y and X1 and X2 correlate.

Path coefficients
Path coefficients are values obtained from carrying out correlation and regression analyses on the independent and dependent variables in the path model [6,22]. As shown in Figure 2, every variable has relationship which is depicted by arrows to one or more variables. Figure 2 Figure 2. Examples of path models [1] regression calculation are called path coefficients. Therefore, every relationship of several variables in the path model will have a value called a path coefficient. In this research, some of the path coefficients are obtained by the Pearson correlation equation. Regression analysis, on the other hand, is used to measure the degree of the influence of independent variables toward a dependent variable. One of the methods used to conduct regression analysis is ordinary least squares (OLS) [16,23,24,25]. In this research, some of the path coefficients are obtained by using ordinary least squares.
Furthermore, a significance test is performed to measure the level of significance of each path coefficient value to ensure that the path coefficient formed is significant. This significance testing uses partial test (e.g., t-test) and simultaneous test (e.g., F-test). Beside t-test and F-test, a testing for goodness-of-fit is also performed, which uses coefficient of determination (R 2 ). Based on the resulting significance level of the path coefficients, a trimming process is carried out to produce a path diagram that consists only of paths that have significant values and to cut away all paths that are not significant.

Structural equations
The final process of the path analysis is the resulting identification of the effects of independent variables on the dependent variable. The relationship between the variables is described in the form of structural equations. The structural equations are constructed by calculating the direct effects (DE), indirect effects (IE), and the total effect (TE) between the variables [1]. The values of DE, IE, and TE are determined based on the path coefficients.

SPATIAL ANALYSIS
Spatial analysis is applied to obtain a more accurate equation model by fixing the errors of data observation. Spatial analysis considers the spatial effects during data observation. Spatial analysis is used to analyze objects (variables) of the research associated with geographic oriented data. Spatial data contains information of a location (coordinates) and the attributes that exist in it. Spatial autocorrelation and spatial regression are approaches that can be used to analyze spatial connectivity.

Neighborhood analysis
Neighborhood analysis is the process of identifying two or more regions of spatial interaction that intersect each other. Identification of spatial relationships is expressed in the spatial-weights matrix W. There are several methods that can be used for this analysis, among them distance method and contiguity method. For distance method, several approaches can be used to identify relationship between regions expressed as latitude and longitude, namely power function, negative exponential function and general spatial weights. For contiguity method, on the other hand, queen, castle, or bishop contiguity approaches can be used to identify the environmental relationship [26]. The choice of methods depends on the case study and in this paper, we choose queen contiguity. An area that intersects another area will have a value of 1, while an area that does not intersect will have a value of 0 [27]. The set of intersection values between these areas forms a matrix called spatial-weights matrix W. In practice, some methods standardize this spatial-weights matrix to obtain W s . The result of neighborhood analysis becomes the basis for the calculation of spatial autocorrelation and spatial regression.

Spatial autocorrelation
Spatial autocorrelation is an analysis of the correlation between the observed regions (spaces) in the form of spatial patterns (patterns of distance, time and area) [28]. On the determination of spatial units (neighborhood analysis), spatial autocorrelation will identify the pattern of spatial relationship based on a certain variable. There are several statistical approaches commonly used to measure the spatial autocorrelation, such as Moran's I [29], Geary's C [30] and the Getis-Ord statistics [31]. In this study, the approach we use to measure the spatial autocorrelation is Getis-Ord statistics (G * i ). Getis-Ord uses a statistical approach to measure the spatial relationship/correlation by using a matrix based on the intersection of the regions. Getis-Ord statistical method is also used to measure whether the data concentration value is high or low in a specific area [32].

Spatial regression
Spatial regression measures the effect of an independent variable towards the dependent variable by identifying the spatial effect. General spatial model (GSM) is a linear regression that incorporates the elements of spatial effect into the model. In general, GSM can be written as [16,23,25]: where y represents the dependent variable and X represents the independent variables in the form of a matrix containing explanatory variables (n × p), u is the error vector (residual) on the spatial autoregressive model (SAR), W s is a spatial-weights matrix of standardized queen contiguity, ρ is a coefficient of the spatially lagged dependent variables ISSN: 2088-8708 W s y, β is the regression coefficients, which describe the influence of the independent variables (X) to the dependent variable ( y), λ is the error coefficient value of spatially autocorrelated W s , and e is the error vector (residual) in the equation of the spatial autoregressive error model (SER). To carry out this spatial regression, several stages must be completed, namely Moran's test, Lagrange multiplier (LM) and maximum likelihood (ML).
Moran's test One approach that can be used to test spatial dependency is Moran's test (Z I ) [27,16,33]. Moran's test is used to measure the significance level of the result of Moran's I calculation.
Moran's I statistic shows the spatial autocorrelation in residual least squares. The Moran's test uses standardized spatial-weights matrix (W s ) to perform the calculations [27]. This test is conducted based on the residual value ε of OLS. The value of Moran's I statistic will be tested to find out its level of significance. From the result of the Moran's test, the existence of the spatial dependence of the variable can be ascertained.
Lagrange multiplier The next test for the spatial dependency is to determine the appropriate spatial regression equation. This process is performed by using Lagrange multiplier. Lagrange multiplier test consists of spatial error dependency (LM error ) and spatial lag dependency (LM lag ). Lagrange multiplier performs calculations based on the least-squares residual (OLS) involving spatial-weights matrix (W s ). Anselin in [16] formulated the equations to test spatial lag (LM lag ), while Burridge in [34] constructed the equations to test spatial error model (LM error ) [17].
Maximum likelihood In order to calculate the spatial regression we need to determine the values of ρ and λ in the spatial regression equation. In this study, maximum likelihood method is used to obtain these values [16].

PROPOSED APPROACH
Basically, the spatial path analysis we now develop is based firmly on the previous path analysis method. The development of the spatial path analysis emphasizes on the spatial process approach involving statistical methods to identify and analyze the spatial effects on variables. In consequence, changes occur in the calculation of the path coefficients. In spatial path analysis there are additional variables, i.e., maps. A map is a spatial data variable that contains an image and attributes. It is in the form of files consisting of shapefile shape (*.shp), shapefile shape index (*.shx) and shapefile attribute (*.dbf).
The stages of the spatial path analysis are depicted in Figure 3, which are further described as follows: Stage 1 Specifying the path model on the basis of a hypothesis that represents the formulated causals, so that the independent (X) and dependent ( y) variables will be identified in the path model [5].
Stage 2 After the path model is specified, neighborhood analysis is carried out. In this study, the method used to identify the neighborhood is queen contiguity. A spatial-weights matrix is produced to calculate the spatial autocorrelation and identify the spatial dependence. Unstandardized spatial-weights matrix is used to calculate the spatial autocorrelation (Getis-Ord), whereas the spatial-weights matrix is used for calculating spatial dependency (Moran's test).
Stage 3 Performing spatial autocorrelation analysis. At this stage, measurement is performed to determine the strength of the relationship between regions based on the variables used. The method used to analyze the spatial autocorrelation is Getis-Ord statistic. Calculation of the Getis-Ord statistics will generate G * i value for each region against other regions based on the variables analyzed. The correlation of G * i values for each of these variables is then calculated using Pearson correlation. The correlation coefficients produced will be used as path coefficients that represent the values of the correlation between the analyzed variables.

Stage 4 Performing spatial regression analysis. At this stage, the calculation of regression equation is performed based
on the model that has been built to identify causal relationship that occurs. This phase calculates linear regression with ordinary least squares to generate the regression model. Furthermore, identifying the spatial dependency using Moran's test is based on the residual value (ε) of regression model that has been formed. As a result, if there is a spatial dependency, the process of determination of spatial analysis models uses Lagrange multiplier test. If there is no spatial dependency, the coefficients in the regression models that have been produced by OLS will be used as path coefficients on path analysis. Spatial regression models will be determined by the values of ρ and λ generated by calculating LM lag and LM error . The result of calculating LM lag and LM error will determine one of such spatial regression models as SEM , SER, and SGM .

Stage 5
After acquiring spatial regression model, the next stage is determining the values of spatial regression coefficients (ρ and λ) by using maximum likelihood method.  The final result of the spatial path analysis is a structural equation expected to identify the effects of independent variables toward the dependent variable by considering the spatial effects that occur in the model.

PROOF OF CONCEPT
Klaten is a regency in Central Java province in Indonesia. Klaten has one of the highest rice productivities in the province. Its total area is 65,556 hectares divided into 26 subdistricts with 39,692 hectares of agricultural land. Geographically, Klaten exhibits very good potentials in agriculture because it has numerous water sources and rainfall of 1,682 mm (year 2015) and most of its area is low land. It is interesting to investigate whether rice productivity and harvest area in Klaten are affected by area, rice field area, rainfall, water source and population.
In this section, we will demonstrate how to use spatial path analysis to identify direct and indirect effects on the dependent variables, namely rice productivity and harvest area in Klaten, based on independent variables, namely area, rainfall, water source and population. Issues that will be analyzed include: ISSN: 2088-8708 1. Is there any effect of the area, rainfall and water source variables toward rice productivity? 2. Is there any indirect effect of the rainfall variable towards rice productivity through water source variable? 3. Is there any effect of the area, rainfall, water source, and rice field area variables toward the harvest area? 4. Is there any indirect effect of the rainfall variable towards harvest area through water source variable?
Based on the problems that have been proposed, a path model is constructed as shown in Figure 4. In the following, we are going to apply spatial path analysis to identify the inter-variable effects as previously described. This identification process involves the existing spatial effects. Neighborhood analysis produces identification of spatial units based on subdistrict areas as indicated by polyID for each of the 26 subdistricts. Figure 5 shows the map of Klaten and also depicts the result of neighboring identification for each of its subdistricts by using queen contiguity method. This enables us to determine each subdistrict's neighboring subdistricts. The red lines represent that a subdistrict is in a neighborhood with other subdistrict(s). The number on each subdistrict indicates its code (P olyID). The result of this identification is then represented in a spatial-weights matrix for subsequent spatial process. Figure 5. The result of neighborhood identification with queen contiguity approach Further analysis of spatial correlation is then performed by applying Getis-Ord computation, whose result is depicted in Figure 6. Three variables are analyzed, namely population, rainfall and area. The Getis-Ord spatial autocorrelation aims to identify spatial tendency of each area based on those variables. This process identifies areas with hot-spot status (high values) that exhibit spatial tendency and areas with cold-spot status (low values) that do not exhibit spatial tendency. Figure 6 shows color difference to indicate the difference of the areas' status, where red color represents areas with hot-spot status while blue color represents areas with cold-spot status.
It can be observed that variables rainfall and area are visible in areas with hot-spot status, i.e., subdistricts that are neighboring to each other. Areas with dark red color exhibit hot-spot concentration toward their neighboring areas. Meanwhile, areas with hot-spot status tend to disperse for the population variable.
Beside color that represents status, Figure 6 also depicts values produced from Getis-Ord computation (G * i ) for each subdistrict. Based on these results, it can be concluded that there exists spatial tendency (hot-spot) in several subdistricts that spatially intersect; and hence they are visually clustered. These conditions exist for rainfall and area variables. On the other hand, for population variable, areas with spatial tendency (hot-spot) are visually dispersed into three groups. For spatial path-analysis computation, the spatial autocorrelation value (G * i ) for each subdistrict will be correlated to produce inter-variable correlation values. The correlation results generate the path coefficient values of the variables. Table 1 shows the Pearson correlation analysis result between G * i values of the three variables, population, rainfall, and area.  More specifically, the Pearson correlation between the G * i values of the area and rainfall variables is 0.6168 (strong) with statistical significance (p-value) of 0.001 (***), indicating very significant. Similarly, the correlation between area and population variables is 0.8181 (very strong) with statistical significance of 0.001 (***), which indicates very significant. These imply that there are positive relationships between spatial autocorrelation G * i values of area and rainfall variables as well as between those of area and population variables. These further suggest that an increase in area variable will also increase rainfall and population variables, although this does not necessarily indicate causal relationships between area variable (explanatory variable) and rainfall and population variables (dependent variables).
Next, the spatial regression estimation is performed; this aims to find the effect of a certain variable on other variables based on the existing spatial effect. As previously described in the computational stage (Figure 3), we run the spatial dependency test based on the residual values of the OLS regression. We use Moran's test for our spatial dependency test. Based on the Moran's test, we find that, in general, all regression residuals (ε i ) exhibit significant spatial autocorrelation; this is shown in Table 2. This result explains that there exist spatial dependence elements in the regression residuals (ε i ).
Based on the test result shown in Table 2, we can determine the appropriate spatial regression model that exhibits significant result for each estimation. We use Lagrange multiplier test for spatial error dependence (LM err ) and Lagrange multiplier test for spatial lag dependence (LM lag ) to determine our spatial regression model. Based on the significance tests, we find that the appropriate spatial regression model is spatial lag model. Table 3 shows the ISSN: 2088-8708 intercept coefficient (β 1 ), the regression coefficient (β 2 ), and the spatial coefficient (ρ) of our spatial regression model generated by the spatial lag model estimation. In the next step, we develop a path model based on the spatial correlation and the regression estimation, which generates a diagram depicted in Figure 7. As a result, based on the total effect, we further obtain the following structural equation that represents the relationships between rice productivity (Y 1 ) variable and harvest area (Y 2 ) variable with area (X 1 ), rainfall (X 2 ), water source (X 3 ), and population (X 4 ) variables: It then can be concluded that rice productivity is affected by area by 3.1660, by rainfall by 9.5341, and by water source by -25.1690. Meanwhile, harvest area is affected by area by 0.1213, by rainfall by 4.1551, by water source by -5.0441, and by population by 0.0484. In order to show more closely how all independent variables influence the dependent variable (based on the resulting structural equations), we show the values of all variables side by side in Figure 8 and Figure 9. The x-axis of both charts in the figures represents subdistricts, while the y-axis represents the values of all independent variables (X 1 , X 2 , X 3 , X 4 ) and dependent variable Y 1 (for Figure 8) and dependent variable Y 2 (for Figure 9).
Based on the results of the first structural equation shown in Figure 8, rainfall (X 2 ) exhibits the greatest contribution that affects rice productivity (Y 1 ), implying that higher rainfall increases rice productivity. The second greatest contribution that affects rice productivity comes from area (X 1 ). As can be seen from the graphical pattern in the figure, area also supports rice productivity when rainfall drops. For example, this can be indicated at subdistricts 4 and 23. On the contrary, conditions at subdistricts 7, 8, and 25 indicate the cases when low value of area is supported by higher rainfall. Next, although water source (X 3 ) has greater coefficient value (−25.1690), the fact that it has small value in each area in Klaten makes its contribution to rice productivity less significant. Overall, these results indicate that the most dominant factor in determining rice productivity in Klaten is rainfall, with area also supporting the rainfall's contribution.  Figure 9. Chart showing how area, rainfall, water source, and population influence harvest area Figure 9 depicts the results of the second structural equation, namely harvest area (Y 2 ). Harvest area is heavily affected by rainfall (X 2 ), with population exhibiting the second largest effect. It can be seen that large population of subdistricts 6, 9, 13, and 22 compensates relatively low rainfall condition to sustain harvest area. Meanwhile, areas with relatively low population are supported by high rainfall level in sustaining their harvest area, such as subdistricts 8 and 24. However, area (X 1 ) and water source (X 3 ) exhibit small effects on harvest area. This indicates that subdistricts' areas are not necessarily associated with appropriate environment for rice fields. Similarly, relatively small water source conditions in each area causes water source to have small effect on harvest area. Overall, these ISSN: 2088-8708 results suggest that the most dominant variable in determining harvest area in Klaten is rainfall, with population also supporting the rainfall's contribution.
Path analysis without spatial analysis For comparison, we also perform path analysis without considering spatial influences, namely by carrying out the method depicted in Figure 1. Figure 10 shows the resulting path diagram and path coefficients produced by this path analysis. The difference between the results of path analysis and spatial path analysis (as shown in Figure 7) is readily observable. This difference stems mainly from the changes to the variables with the greatest contribution toward rice productivity and harvest area. Other experiments we have conducted also indicate that spatial path analysis produces models that are more consistent than path analysis. This means that spatial dependence is an important factor in path analysis. Proper identification of spatial dependence on the measurement of data results in a more accurate model.

CONCLUSION
Spatial path analysis is capable of identifying possible spatial effects on variables of interest. The ability of path analysis in disentangling inter-variable relations, both direct and indirect effects, allows one to identify the degree of contribution of each variable toward other variables. Including spatial analysis into path analysis produces structural equations that take existing spatial effects into account. As a consequence, spatial path analysis is supposed to generate better results for cases that involve variables that exhibit spatial tendency.