A non-negative matrix factorization based clustering to identify potential tuna fishing zones

Many nonnegative matrix factorization based clusterings are employed in discovering pattern and knowledge. Considering the sparseness nature of our data set about the daily tuna fishing data, we attempted to utilize a clustering approach, which is based on non-negative matrix factorization. Adding sparseness constraint and assigning good initial value in the modified NMF method, a proposed algorithm Direct-NMFSC yielded better result cluster compared to other methods which are also utilizing sparse constraint to their approaches, SNMF and NMFSC. The result of this study shows that Direct-NMFSC has 5.376 times of iteration number less than NMFSC in average with 531.97 as the CH index result. The determination of potential fishing zones is one of the essential efforts in the potential fishing zone mapping system for tuna fishing. By means of this novel data-driven study to construct the information and to identify the potential tuna fishing zones is done. We also showed that utilizing the Direct-NMFSC can spot and identify the potential tuna fishing zones presented in red cluster that covers both the spatial and temporal information.


INTRODUCTION
Initially, nonnegative matrix factorization (NMF) was used as a technique in low-rank matrix approximation so that it was eventually used in various of applications, and one of them is such as in performing clustering tasks [1]. It was first introduced by Lee and Seung [2], NMF as a method for decomposing data into a low-rank factorization matrix, and eventually, there were many intuitive studies for partitioning based problems that developed into many variants in NMF clustering [3]. Some of the NMFbased clustering methods are used in discovering geometric property of data [4], and Liu et al. [5] proposed a novel semi supervised non-negative matrix factorization to detect the popular community. NMF-based clustering is also successfully applied in image processing and computer vision areas [6], [7]. Inspired by the number of success stories in applying the NMF-based clustering, in this paper the authors propose to employ and adjust the method with the data characteristics. In this study, the authors use the daily tuna catch data. Regarding the data, normally it is used in many studies on identifying tuna potential fishing zones (PFZ). These studies involved many relevant disciplines such as marine technology, fisheries, remote-sensing, and computer science. In determining the PFZ, the computer science-based approaches are [8]- [11] and all of them take advantage of the methods and approaches based on their discipline to create the mapping of the potential fishing zone and the prediction of potential fishing zones.
In machine learning approaches, there are many activities involved in predicting areas of potential fishing zones, the most fundamental activity is to construct the ground truth data. Many studies, the ground truth data construction is based on the classification method, while others are based on the unsupervised learning to create the potential fishing zones [12]. Which are yielded a better result in determining the ground truth PFZ. The previous research of ground truth data construction is utilizing the grid density-based clustering algorithm [8]. This framework yielded a good result, yet the accuracy of the prediction was still not satisfactory.
The fish catch data generally is the sparse data, which means that it only has a few units are effectively used to represent the potential fish catch. This implies that most units taking zero values while only a few takes significantly non-zero values. The side-effect of NMF can produce a sparse representation of the data [13], but we cannot control the degree of the sparse representation. Hoyer [14], has added the option to control sparseness of NMF. The aim of that research is to constraint NMF to find best solutions with desired degrees of sparseness by tuning the sparse constraint parameter [13]. Based on this situation, this research attempted to utilize different clustering approach which is based on NMF which shaped the cluster to its matrix factorization directly to construct the information of potential fishing zones.
We can use this approach to accommodate the sparseness of the fish catch. To decrease the time consuming of the process, we can choose a good value to be the initial parameter. In this paper, we use the pre-computed sparse parameter from the distribution of the data as an initial parameter, so it will decrease the time to converge. The initial sparse parameter is computed using sparseness equation which will be explained in chapter 2.

NONNEGATIVE MATRIX FACTORIZATION AND RELATED STUDIES
There are many related studied regarding NMF implemented is some case studies, such as the study conducted by Shahnaz et al. [15], NMF was used for documents clustering. NMF can be used to organize a collection of text documents into groups directly from the factors. In the analysis of text data, these factors formed in the nonnegative vector, the vector representing the semantic features of documents, like the collection of words that indicate a particular topic. Pauca et al. [16] have been used an effective NMF algorithm with novel smoothness constraints for unmixing spectral reflectance data for space object identification. Sun and Sang [17] proposed an algorithm spatial-temporal clustering (STClu) based on nonnegative matrix factorization for grouping the data that flows continuously in time series. STClu algorithm works by combining two adjacent sensor data which are then integrated with spatial and temporal information as consideration for clustering. STClu algorithm has been tested on the synthetic data and real data of traffic flows in US highways. Image processing and computer vision apply the NMF method to discover the image representation as in Li et al. proposed a novel semi supervised learning method that is based on NMF by explicitly exploring the structure of the NMF block diagonal [18]. Another study on NMF-based cluster in image processing is done by Woo et al. to identify the functional units of tongue based on the mechanisms of normal and abnormal muscle coordination patters. Using the magnetic resonance imaging shed light images, this study yielded excellent result in providing improved treatments for patients [19]. Deepthi et al. [20] modify the NMF to perform image compression method for multimedia data. Dai et al. [21] proposed a weighted nonnegative matrix factorization (WNMF) as an improvement of the NMF to recover the noise data utilizing a weighted graph to label 1 for the fine data and 0 for the bad data. Several studies suggested the expansion and modification of standard NMF model. Hoyer [14] extended the NMF framework to include an adjustable sparseness parameter which called nonnegative sparse coding. Liu et al. [22] analyzed the need for incorporating the idea of sparseness and suggested a development called Sparse non-negative matrix factorization (SNMF). Their extension is similar in idea and form to that given by Hoyer [14], with the advantages of a faster algorithm. Kim and Park [23] proposed the use of SNMF as a method of clustering data. Examining the sparsity constraint on the coefficient matrix factor in the calculation of objective function NMF, NMF will be used as a method to perform clustering. However, nonnegative sparse coding and SNMF also suffer from the drawback that sparseness is only controlled implicitly. Hoyer [13] suggested the NMF with sparseness constraints (NMFSC) which extend NMF to include the option to control sparseness directly. The factorization method is the process of splitting or decomposing of a matrix into several matrices. The purpose is to find matrices which have a useful representation and reduce the dimensionality. Generally, factorization is divided into two groups: the direct and the successive approximation group. Nonnegative matrix factorization is a frequently used approximation method. This method describes a matrix A × into two matrices which are basis matrix W × and coefficient matrix H × , where = (A) = ( , ) and all the elements of W and H are non-negative. The decomposition method with NMF generally can be seen in the (1).
Where each column A is representative of an object, NMF is estimated by a linear combination of the k base in column W. The conventional approach to get W and H to minimize the difference between A and WH, as shown in (2).
Subject to W, H ≥ 0 Where ‖. ‖ is the Frobenius norm as shown in (3), The most popular approach is the multiplicative algorithm [23]. It is quite easy for implementation and often provide the best results. The pseudocode of NMF algorithm describes in Figure 1. Due to the lack of tuna fish catch data in spatial coordinates, sparseness constraints should be added to the grouping area of fish catching.

Sparseness
Sparseness is a representational scheme where only a few units are effectively used to represent typical data vectors [2]. Sparseness measurement as shown in (4).
Where x is the data points, n is the dimensionality of x. If x contains only a single non-zero components, the value is 1; and if all components are equal, the value is 0.

Sparse non-negative matrix factorization
Kim and Park [23] impose the sparsity on the H factor so that it could indicate the clustering membership. The modified formulation as shown in (5).

Nonnegative matrix factorization with sparseness constraints
Given a nonnegative matrix A of size m × n, find the non-negative matrices W and H of sizes n × k and k × n such that in (2) is minimized, under optional constraints sparseness (wi) = Sw, ∀i, and sparseness (hi) = Sh, ∀i, where wi is the i-th column of W and hi is the i-th row of H. Sw and Sh are the desired sparseness of W and H respectively, which are set by user, so that the user may do not know what the optimal value for these parameters is. A tuning parameter to get the optimal value of these two parameters should be done. If the initial value before tuning has a large difference with the optimal value, this process will take a long time to execute. In this case, choosing a good initial value of the parameter is a must.

RESEARCH METHOD
In this research, we tried to compute sparseness directly using (4) as an initial parameter of NMFSC. Hence, it will need only few times to find the parameter that yields the desired sparseness level as explained in the section before. Let us call this proposed approach as Direct-NMFSC, the detailed flow will be shown in the next subsection c.
The method of NMF is adopted with sparse constraint to cluster the spatio-temporal fish catch data. Because the fish catch data are very sparse in the spatial dimension and in the time dimension, so it requires both matrices W and H to be sparse. We had to initialize the sW and sH parameters in the range of 0 to 1. In this paper, we use the values for these parameters using (4).
There were several stages in the implementation of this study, it is shown in Figure 2. First, we started with the tuna fish catch data which we then preprocessed to get more suitable data. Next, we performed the clustering process, here we compared three methods which were K-means, SNMF, and NMFSC. In our final step, we evaluated the clustering performance. The detailed process is explained in the next subsection d.

Tuna fish catch data
The data are obtained from the tuna fish catch in PT. Perikanan Nusantara (Persero) Bali, Indonesia, which consists of several types of tuna (Albacore, Bigeye, and Yellowfin). However, in this study we used the data catch Albacore tuna species because the other tuna species are caught less in number. Area of fishing was in the Indian Ocean which ranges in latitude coordinates 2-16.56 S and longitude 100.49-140 E from 2000 to 2005, where there were as many as 1,271 points Albacore tuna catches.

Data preprocessing
In the data pre-processing phase, we attempt to transform the data to be more suitable for the research [24]. Before the clustering process, the fish catch data points were aggregated into spatial grids. The points that are close together in a single grid interval represent the same catchment area. Grid intervals of 0.5 degrees (55.6 km) used in this study. Data capture is also aggregated by temporal in months, so we get a total of 72 months in the temporal dimension (6 years). The data preprocessing result will be a matrix of m×n, where m is the number of spatial grid and n is the aggregate amount of the monthly temporal. Then take the grid coordinates of the existing tuna catches in the span of 6 years and found there are as many as 222 grid catches. The spatial coordinate's plots of the grid catches in the range of 2000-2005 are shown in Figure 3.

Data clustering using Direct-NMFSC
In this research, we wanted to know the clustering result of tuna fish catch data using sparse nonnegative matrix factorization (SNMF) [17] and non-negative matrix factorization with sparse constraints (NMFSC) [13], both are the development of the NMF algorithm with the addition of sparseness constraints. Essentially, NMF is an algorithm used to perform dimension reduction (features) such as principal component analysis (PCA). However, the NMF method can be used to perform clustering by adding a sparseness constraint on the calculation objective function. These clustering results have been compared with the K-means algorithm as a baseline method. We utilize the K-means algorithm as the baseline since Kmeans is considered good to handle clustering quite well [25].
We proposed to compute the sparseness directly prior to the NMFSC process, which is in the standard version of NMFSC, we must do the tuning of the sparseness value first. This approach only need to execute the method as much as the number of clusters, compared to the basic NMFSC which need to find the proper sparseness index, so it will take ( 2 ). The flow of this Direct-NMFSC is shown in Figure 4.

Cluster evaluation
The internal cluster validation is used in this experiment to measure the clustering result since there is no actual label that can be used to compare the result. We used internal validity indices, Calinski-Harabasz (CH) index. The CH index [26] evaluates the cluster validity based on the average of between and within cluster sum of squares, as shown in (6).
Where denotes the error sum of square between different cluster, is the squared differences of all objects in a cluster from their respective cluster center. The maximum score of CH index, indicates the better cluster separation.

RESULT AND DISCUSSION
The results of proposed method have been compared to clustering using methods of K-means, NMF, SNMF, and NMFSC. The experiment simulation program is implemented using MATLAB, K-means clustering method implemented using "K-means" functions and standard NMF implemented using "nnmf" function in MATLAB, while SNMF and NMFSC use the code that has been implemented by Hoyer [13] with a slight modification to the convergence constraint, that is when the objective function in the i-th iteration objective function is greater than the previous iteration, then the clustering process will stop.

5463
Data clustering trials carried out experiments on fish catches cluster number from 2 to 72 (maximum number of months/temporal). The experiment of SNMF and NMFSC methods were conducted with the tuning parameter of sparseness coefficient, which is ranging from 0 to 1 with a gap of 0.2. Also, in our research, we wanted to know whether calculating in (4) as an initial parameter would give good sparseness results. Data clustering experiment scenario with some methods is described in Table 1. Each experiment number of cluster validity index value was calculated using the CH index. The validity index for each method shown in Table 2.  The experiment using Direct-NMFSC results the highest CH index, which is 531.98 with the precomputed sparseness index is 0.9584. Figure 5 shows the cluster validity comparison between D-NMFSC and the other 4 methods indicated by the black line and points. In Figure 5(a) and Figure 5(b), Direct-NMFSC is better in CH index than K-means and NMF method while the Direct-NMFSC relatively similar to NMFSC Figure 5(d). If we assign the optimal number of clusters using CH index evaluation, the most dominant grouping is 2. Figure 6 shows the plot cluster results using Direct-NMFSC with several clusters 2. We also analyzed the number of iterations for 3 algorithms with sparseness constraints. First, we compare the iteration number to obtain the best result with a number of cluster of 2, the comparison presented in Table 3. Direct-NMFSC results the best validity index within 80 iterations, better than NMFSC which takes 385 iterations. SNMF needs 35 iterations but its validity index is not particularly good compared to Direct-NMFSC. Second, the average and standard deviation of each scenario in Table 4 shows Direct-NMFSC method gives the less iteration number. Beside the average number of iterations from each method, we also inspect the number of iterations of three methods with sparse constraints as shown in Figure 6.  After obtaining the result from the experimental scenarios, next is to apply the Direct-NMFSC for the tuna fishing data. From the clustering displayed in Figure 7, we know that there are two clusters formed. Cluster 1 is represented by by the red asterix and cluster 2 is represented by blue asterix. The spatial range for red cluster is between 12°-15° S and 105°-118° E and the time is between October and February, while for blue cluster is ranging between 8°-17° S and 100°-140°E and the time is between January and May. From cluster analysis we find out that the red cluster is the potential fishing zone. These two clusters are evaluated to measure how well the data is group by the Direct-NMFSC. The Silhouette index coefficient [27]- [29] used in the cluster validation yielded 0.879, it shows that each points has good cohesion inside each cluster and distinguished separation value between clusters.

CONCLUSION
Due to the sparseness of the fish catch data, we should add the sparseness constraint to NMF based clustering approach and assign a good initial value. From the experiment results, we conclude that Direct-NMFSC, modification of standard NMF, yielded the good validity index compared to K-means, NMF, and SNMF and NMFSC. The execution time of Direct-NMFSC algorithm can be decreased by choosing the good and appropriate initial parameter. One of the proposed solutions is the utilization of (4) to compute the initial value. The result of this study shows that Direct-NMFSC is outperformed the other 3 algorithms (NMF, SNMF and NMFSC) since it has 5.376 times iteration number less than NMFSC in average, which is 212.042 iteration and it also yielded best cluster validity index 531.97. The information on PFZ can be achieved by utilizing the data driven approach using NMF clustering algorithm. The result shows that the red cluster represent the potential fishing zone, the cluster also shows the spatial and temporal information on the potential tuna fishing zone (between 12°-15° S and 105°-118° E and the time is between October and February).