Visualization of hyperspectral images on parallel and distributed platform: Apache Spark

ABSTRACT


INTRODUCTION
Currently, digital display devices produce a color image for the human eye using a combination of three primitive colors.So, a classic red, green and blue (RGB) color image is a combination of three layers (bands): RGB.A hue, saturation and lightness (HSL) color image is a combination of: HSL [1].On the contrary, we find hyperspectral images composed of hundreds of layers (bands).A hyperspectral image can be described as a three-dimensional data cube consisting of two spatial dimensions and one spectral dimension.In this representation, each pixel contains a spectrum of wavelengths within the visible-near infrared range, spanning from 400 to 1,400 nanometers.
Hyperspectral imaging is frequently used in the field of remote sensing, environment monitoring [2], [3], polarimetric imaging, land cover classification [4], [5] and multimodal medical imaging.In astronomy, for example, hyperspectral imagery is used to archive soil and space observations.In medical imaging, hyperspectral imaging is used for the detection of diseases such as cancer [6].
Hyperspectral imaging produces high-dimensional data where each pixel in the image is represented by a spectrum of measurements across many different wavelengths.However, the high dimensionality of this data can make it challenging to analyze and interpret.Now, how to visualize a hyperspectral cube and give the user, not usually specialist, a synthetic view of the data contained in the image with the minimum possible loss,  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 13, No. 6, December 2023: 7115-7124 7116 and facilitate the interpretation of the image.Among the first solutions proposed is to visualize all the cube in the form of a video sequence, each layer of the cube is represented by an image.However, when we work in the plan and with a lot of hyperspectral images of large spectral dimensions, this solution remains difficult to practice.
So, to visualize a hyperspectral image in color and in the plan with the number of spectral bands which exceeds three bands, it is often necessary to reduce the dimensionality of hyperspectral images and obtain, from the original image, a composite image that consists of three bands.Dimension reduction techniques such as principal component analysis (PCA) or t-distributed stochastic neighbor embedding (t-SNE) can be employed to transform the high-dimensional hyperspectral data into a lower-dimensional space while preserving as much of the information as possible.By reducing the number of dimensions, it becomes easier to visualize and interpret the data.By visualizing the reduced-dimensional data, researchers and analysts can gain insights into the underlying patterns and relationships in the data, which can help with tasks such as identifying and classifying different materials or objects in the scene.This, in turn, can have applications in fields such as remote sensing, agriculture, and environmental monitoring.Several methods of visualizing hyperspectral images exist: Methods based on spectral band selection [7]- [9], methods based on weighting [5], methods based on optimization [6] and projection-based methods [10].
To bypass the computational problem posed by the processing of large hyperspectral cube [11], [12].We will use an open source framework named Apache Spark [13], which distributes data storage in memory random-access memory (RAM) and processes the data in parallel.This choice has given us a considerable gain in the time of visualization of a hyperspectral image.
The rest of this paper is structured in the following manner: in section 2, we will present the related work on the hyperspectral image visualization methods.Next, in section 3, we will describe our parallel distributed visualization approach based on the PCA projection method.In section 4, we will experiment our approach on several free hyperspectral images.And we finish the paper with a conclusion and perspectives.

RELATED WORKS OF HYPERSPECTRAL IMAGE VISUALIZATION METHODS
In the field of hyperspectral imaging, there are several methods used to visualize a hyperspectral image.The literature review revealed that the four main methods used for this purpose are based on band selection, weighting, optimization, and transformation.The first method is based on band selection [7]- [9].To visualize a hyperspectral image in an RGB representation system, three spectral bands must be selected from the original hyperspectral image composed of hundreds of bands.Then, each band will be assigned to a color: red, green and blue.This type of visualization method is used in the AVIRIS browser [14].The visualization with this method is fast, but we take, just the existing data in the three selected bands and the data from other bands will be ignored.So, a large amount of existing information in the image is lost.The second method is based on weighting [15].This method provides an image resulting from a linear combination of the input image bands.In this method there are two types: Method based on stretched color matching functions (CMFs) and Method based on bilateral filtering.The advantage of this method is the use of all the bands of the image, but the problem arises in the choice of the same weight that will be attributed to the pixels of the image by ignoring the variety of pixels.The third method is based on optimization [16].In this method, some functions are applied to the image according to the optimized criterion.We find: Method based on Markov random field and Method based on multi maximization goals.The big challenge for this method is how to find the right function to apply it to the image.The last method is based on transformation.With this method, we can visualize a hyperspectral image by projecting the original image on a smaller dimension (three for example).Over the past few years, several techniques of dimensionality reduction have emerged to decrease the dimensionality of hyperspectral data to a lower-dimensional space, important examples include: Hessian eigenmap embedding, locally linear embedding (LLE), isometric feature mapping (ISOMAP), Laplacian eigenmap embedding, diffusion maps, conformal maps, independent component analysis (ICA) [17] and PCA [18], [19].
In this paper, we will use the PCA algorithm of the last method to do the visualization.PCA is among the dimension reduction algorithms that can be implemented effectively and which is used successfully in commercial remote sensing applications [20].Since we are visualizing a large hyperspectral image, PCA does a lot of computation time [11], [12].So, to solve this problem, we will use a distributed and parallel computing.
At present, there exist two widely-used libraries that offer a parallel distributed implementation of the PCA algorithm: MLlib on spark [21], [22] and the Mahout based on MapReduce [23].Elgamal [24] demonstrated that these two libraries do not allow a flawless analysis of a large mass of data and introduced a novel implementation of PCA called sPCA.This proposed algorithm exhibits superior scalability and 7117 accuracy compared to its competitors.Wu et al. [25] proposed a new distributed parallel implementation for the PCA algorithm.The implementation is done using the Spark platform and the results obtained are compared with a serial implementation on MATLAB and a parallel implementation on Hadoop.The comparison demonstrates the effectiveness of the proposed implementation in terms of both precision and computation time.

THE PROPOSED VISUALIZATION APPROACH
To comprehend the information concealed within the hyperspectral image cube or extract a relevant portion of the image, visualization is often employed.However, due to the limitations of human perception, we can only visualize a limited number of hyperspectral bands (typically up to 3).Before embarking on the visualization of our hyperspectral image, it is necessary to reduce the number of spectral bands to 3 without compromising the quality of information.In the subsequent steps, we will employ PCA, a widely used technique in various domains such as dimensionality reduction, image processing, data visualization, and discovering underlying patterns within the data.

Classic PCA algorithm
PCA [26] is a dimensionality reduction technique employed to reduce the dimensions of a matrix containing quantitative data.This approach enables the extraction of the dominant profiles from the matrix.To utilize the PCA algorithm (refer to Algorithm 1) on the hyperspectral image, we consider the hyperspectral image M as a matrix of size ( =  × , ), where C represents the number of columns, L represents the number of rows, and N represents the number of bands in the image.It is important to note that  >> , indicating a significantly higher number of pixels than the number of spectral bands.Every row of the matrix M corresponds to a pixel vector.For example, the first pixel is represented by the vector: [11, 12, … , 1], with M1j is the value of the pixel 1 taken by the spectrum of number j.Each column of the matrix M represents the values of all pixels in the image captured by a specific spectrum.For example, 1 = [11, 21, … , 1] represents the data of the image taken by the spectrum 1.In the formula (1),  ̅̅̅̅ denoted the average of column j and  denoted the standard deviation of column j.In the formula (2),   . denoted the matrix product between the transpose of the matrix  and the matrix .

Proposed distributed and parallel PCA algorithm
Due to the large size of hyperspectral images, the traditional PCA algorithm necessitates computationally intensive processing.In this section, we will introduce a parallel distributed implementation method of the algorithm utilizing the Spark platform.Given that a hyperspectral image captures the same scene across multiple spectral bands, we can decompose the hyperspectral image into individual images,  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 13, No. 6, December 2023: 7115-7124 7118 each representing a specific spectrum (as depicted in Figure 1).First, we start by transforming the hyperspectral cube of the image into a one-dimensional V vector of size N.Each element of V contains an image of size L×C according to a certain band Figure 1.Now, each Vt image, will be stored in the memory RAM as a resilient distributed data set (RDD).To make a parallel distributed implementation of PCA, we will leverage the map-reduce paradigm of Spark.The proposed algorithm operates in the following manner:  Step 2: Calculate the MRC correlation matrix of size (N, N) denoted:  As stated in step 1, the MRC represents an image vector of size N. Every image   corresponds to a reduced and centered matrix.We will now utilize Spark's distributed parallel computation framework, MapReduce, to compute the correlation matrix of size (N, N) by performing a matrix product operation between the MRC T and MRC vectors.
, = Step 3: Calculate the eigenvector and eigenvalues of the MCorr matrix: [, ] Step 4: Arrange the eigenvectors in descending order based on their corresponding eigenvalues and select the first three columns of V (3 < ) Step 5: Perform a projection of the matrix M onto the vector :  = . Step 6: Utilize the newly obtained matrix U, which has a size of (m, 3), for the purpose of visualizing the hyperspectral image.

EXPERIMENTS AND COMPUTATIONS
To test the proposed algorithm, the free visible air infra-red imaging spectrometer (AVIRIS) Moffett Field image was used with 224 spectral bands in the 2.5 nanometers to 400 nanometers, which was acquired on August 20, 1992 [14].On this hyperspectral image, we took samples of different sizes, see Table 1, and then on each image obtained, we tested the proposed distributed parallel algorithm and a serial implementation of classical PCA from the Python library scikit-learn.We collected the three most significant eigenvalues as shown in Table 2 and the execution time of each algorithm as shown in Figure 2.  The classical PCA of the scikit-learn library is tested on one computer equipped with: CPU: Intel® Core ™ i7-2820QM CPU @ 2.30 GHz × 8, RAM: 8G, OS: Ubuntu 16.04 LTS.The proposed distributed parallel algorithm has been tested on the cloud Databricks [26] of the configuration, as shown in Table 3.Both algorithms are programmed with the Python language.The runtime comparison of the two algorithms shows the speed of PCA sklearn for small images, but if the image has a high number of bands, more than 10 spectral bands our proposed PCA is faster, see Figure 2. Figure 3 illustrates the visualization of a hyperspectral image dataset.In Figure 3(a), the image is displayed without any PCA applied, presenting the original representation.In contrast, Figure 3(b) showcases the image after the application of either classical PCA or the newly proposed PCA distributed method.

CONCLUSION
In this work, a method of visualizing a hyperspectral image has been proposed based on the reduction of the dimensionality of the image in a parallel distributed environment.The algorithm has been developed using Python 3, and evaluated on hyperspectral images utilizing the Spark platform.The results obtained align with those of traditional PCA, and the visualization of the images post-application of our reduction confirms the validity of our algorithm.By comparing the execution time of the two algorithms: sklearn PCA and proposed PCA, we discovered that the proposed PCA algorithm displays faster performance when processing large images.This observation implies that the proposed PCA algorithm may be more efficient and effective in handling larger data sets compared to the classic algorithm.

Algorithm 1 .#−−
Classical PCA algorithm Algorithm Classical_PCA(M) Input: matrix M of dimension (m, N) Output: matrix U of dimension (m,3) # Calculate the reduced centered matrix of M denoted: MRC − for each  = 1. . . and for each  = 1.Calculate the correlation matrix of size (N, N) denoted: MC Calculate the eigenvalues and eigenvector of the MC matrix denoted: [, ] Sort the eigenvector in descending order of the eigenvalues and take the first k columns of ( < ) − Project the matrix M on the vector :  = . − use the new matrix U of size (m, k) for the visualization of the hyperspectral image − return U

Figure 1 .
Figure 1.Descriptive diagram of the proposed PCA algorithm

Figure 2 .
Figure 2. The runtime comparison between sklearn and the proposed PCA

Figure 3 .
Figure 3. Visualization of hyperspectral image (dataset 12) in (a) the image is displayed without PCA, while on the (b) the image is shown after applying either classical PCA or the proposed PCA Calculate the standard deviation of image Vt denoted σ t for each line  of image   do for each value   ,  do calculate   ,  = (  ,  *   , )/() return 2In the formula 4,   denoted the standard deviation of image   .Algorithm 3. Calculating the reduced centered matrix with Spark (6)∑ ∑    .To determine the value of each  , in formula(6), the image   is multiplied by the image   pixel by pixel.Then we calculate the average of result algorithm 4: Algorithm 4. Calculation of the correlation matrix using Spark 1 LxC ( MRC t .MRC k ) for each  = 1 to N and for each  = 1 to N (6)  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 13, No. 6, December 2023: 7115-7124 7120 with

Table 1 .
Datasets Visualization of hyperspectral images on parallel and distributed platform: Apache Spark (Abdelali Zbakh) 7121

Table 2 .
Example of the top three eigenvalues obtained from PCA

Table 3 .
Configuration parameters of cluster spark in cloud Databricks