A framework for cloud cover prediction using machine learning with data imputation

ABSTRACT


INTRODUCTION
The climate of a region directly affects human life in many ways.The significance of predicting climate parameters like wind speed, dew point temperature, and cloud cover helps us in determining and planning crop yield.The energy and aviation sectors also benefitted from the predictions.Cloud cover of a place helps predict the rainfall and sunshine duration which leads to better planning of solar energy initiatives.The cloud cover also affects visibility which is a very significant factor for airline operations.The prediction of cloud cover is influenced by various factors like rainfall, wind speed, and direction, and vapor pressure.The unit of measurement of cloud cover is oktas.
In Maharashtra, the climate varies significantly from one region to another.The different regions of Maharashtra are Vidarbha, Marathwada, Konkan, and Madhya Maharashtra.The state experiences heavy cloud cover in some parts during certain months and also experiences low to negligible cloud cover during other months in different regions.Since agriculture is the main occupation of Maharashtra, it is important to predict the cloud cover so that rainfall can be determined and proper water management can be done.Some regions experience severe drought conditions due to lack of rainfall whereas some regions experience floods due to excessive rainfall [1].Over the last few years, machine learning techniques have become widely popular due to their application to a large number of environmental causes [2].It is used for the prediction of A framework for cloud cover prediction using machine learning with data imputation (Nabanita Mandal) 601 drought, and rainfall prediction [3], [4].Water level prediction can also be done with good accuracy using deep learning algorithms [5].The amount of rainfall a particular region receives can be decided by predicting the cloud cover of that region.Cloud cover determines how much the sun is obstructed by clouds.This helps in solar plant operation.Cloud motion vectors are used for forecasting solar radiation [6].Tracking of the cloud is done using binary cross-correlation.Along with this, the maximum cross-correlation technique is utilized.Quality control is done by measuring the vectors for incorrectly detected motions.Assessment of cloud cover is also done in numerical weather prediction by researchers [7].A new facility was introduced which conceals the sunburn effect present in the background.Detection of thin clouds is better when using this method with artificial neural network (ANN) [8].To determine the cloud over a particular region, satellite images are also used.The data set created uses satellite images for the classification of cloud patches [9].Convolution neural network (CNN) along with data augmentation and regularization was used.The satellite images were also used for predicting the movement of clouds using neural networks [10].Further deep-learning techniques are also used to classify cloudy or clear skies using images [11].Various deep-learning techniques are applied to these satellite images for forecasting [12]- [15].
This paper consists of five sections.Section 2 consists of the proposed framework.It describes the proposed data imputation method and the proposed model.Section 3 is the method section which includes the dataset details and data handling techniques.Section 4 describes the results and discussion which consists of the tables and their description.Section 5 is the conclusion section which describes the summary of the overall work done in this research.

PROPOSED FRAMEWORK
The proposed framework is based on machine learning which emphasizes learning by identifying the pattern of the data.The framework consists of data imputation and model building.Figure 1 shows the proposed framework.The proposed data imputation method eliminates the need for deletion of rows containing missing values thereby preserving the size of the dataset.It uses iterative and k-nearest neighbors (KNN) imputation.The proposed model is deep learning based which uses the principle of long short-term memory.It consists of small memory units that can handle data in the time series format and also pass the information bi-directionally.The idea behind this proposed framework is to utilize the entire data that is available and build a model that gives less prediction error.

METHOD
To obtain the prediction, in this research the dataset has been acquired from the India Meteorological Department, Pune, Maharashtra, India [16].The dataset consists of 8 stations namely: Akola, Nagpur, Chikalthana, Parbhani, Colaba, Ratnagiri, Kolhapur, and Parbhani.The different features that are included here are mean sea level pressure (MMSLP), mean dew point temperature (MDPT), mean relative humidity (MRH), mean vapor pressure (MVP), mean total cloud (MTC), total mean rainfall (TMRF), mean wind speed (MWSP) and the directions of wind: north (N), south (S), north east (NE), south east (SE), east (E), west (W), south west (SW), north west (NW).The raw data is analyzed and it is found that there exist some missing values.These are handled using data imputation techniques which are described in the next section.The resultant is the imputed data which is further scaled.The necessity of scaling arises because different features are measured in different units.Some of them are measured in percentage, millimeters, and kilometers per hour.The data then further needs to be trained and tested.The machine learning model takes this time series data as input [17], [18].Figure 2 represents the steps followed for handling the data.

Data imputation
The missing values are predicted using data imputation techniques [19]- [25].To predict the mean total cloud cover of a region, it is important to understand how the features influence the target variable.In this proposed data imputation method, this influence is studied by understanding the correlation between them [26].To determine the correlation, the concept of heatmap is used [27].A heatmap is a visual representation of the different features.It uses a color-coding scheme to show the correlation.The most correlated features are combined into one group and the less correlated features are put in another group as mentioned in Figure 1.In the proposed method, iterative imputer is applied to features of the first group whereas the second group of features uses KNN imputer [28].Table 1 represents these features for different stations.It is observed that for all 8 stations, there are some common most correlated and some common least correlated features.

Model building
In this proposed model, first, the data needs to be arranged in a time series format.The conversion of data into time series is an important aspect here.Time step, feature, and batch size are used as input, and output is based on values of features at previous timesteps along with current state values.Figure 3 represents the proposed model.The basis of the proposed model is bi-directional long short-term memory (LSTM) of deep learning [29]- [32].The benefit of using this model is that the information stored in the cell is used for future processing.It is a bidirectional model where the processing is sequential and consists of two LSTMs: one will take the input in the forward direction and the other will take it in the backward direction.Training of both LSTM models is done considering the training testing split for 80%-20% and 90%-10%.
The number of layers used in this proposed model are 3 and to validate the results, mean square error (MSE), root mean square error (RMSE), and mean absolute error (MAE) are used [33], [34].The proposed model is a sequential model created by stacking each layer one by one.There are 3 layers: two bi-directional layers and a dense layer.The first bi-directional layer allows the simultaneous processing of input sequences in both directions.The output of this layer goes to the second bi-directional layer which again processes it in both directions.The dense layer is the last layer which gives a single output for each input.

ISSN: 2088-8708 
A framework for cloud cover prediction using machine learning with data imputation (Nabanita Mandal) 603 Figure 3. Proposed model

RESULTS AND DISCUSSION
The model is trained for various epochs: 100, 200, 300, 500, 800, and 1,000 for two different approaches.In the first approach, the model is trained by taking 80% of the data and tested using the remaining 20% of the data.In the second approach, the model is trained by taking 90% of the data and tested on 10% of the data.The model performance is evaluated by considering the MSE, RMSE, and MAE.

Model performance evaluation using MSE
For the evaluation of the models, the mean squared error is used.In (1) represents the average of values of the squared difference between actual (  ) and predicted values (ŷ  ).

Model performance evaluation using RMSE
The root mean squared error is represented by (2).The RMSE values by changing the number of epochs is shown in Table 5.The model is trained considering 80% of data and tested at the remaining 20% of data.
Ratnagiri and Nagpur stations show less RMSE value for 100 epochs.For 500 epochs, Akola, Chikalthana and Kolhapur stations show less RMSE.Parbhani station shows less RMSE for 200 epochs, Nashik station shows less RMSE value for 800 epochs and Colaba station shows less RMSE for 1,000 epochs.Table 6 depicts the RMSE values for 90% of training data and 10% of testing data.RMSE values are less for 100 epochs for Ratnagiri, Akola, and Nagpur.For Chikalthana station 500 epochs and for Colaba station 300 epochs give less RMSE.For 200 epochs, low RMSE values are shown for Parbhani, Kolhapur, and Nashik stations.Table 7 shows the epoch numbers and the approach where the least RMSE values are obtained.Colaba and Parbhani stations give low RMSE values for the second approach where 90% of data is used for training, the rest of the stations give less RMSE for the first approach where 80% of data is used as training data.

CONCLUSION
The machine learning-based model used to predict the cloud cover in different regions of Maharashtra depends immensely on the data because it is used for learning the pattern.The correlation-based data imputation method proposed here gives the most correlated and least correlated features.The direction of the wind in south (S) is the least correlated feature which is common to all the 8 stations.The remaining features apart from certain wind directions are mostly correlated to each other for all the stations.Iterative imputer replaces the missing values by repeatedly iterating over the most correlated features and KNN imputer uses least correlated features to replace the missing values using it is nearest neighbors.The imputed data is scaled and split for training and testing.The model is built by considering a two-way approach to information flow.The model runs for multiple epochs to give the least MSE, RMSE, and MAE values.The comparison is done for 8 stations based on the two approaches used.It is observed that the model trained at 80% and tested at 20% has the least MAE, RMSE, and MSE values for most of the stations as compared to the model trained at 90% and 10%.

Table 2
represents the MSE values of different stations for different epochs considering the first approach where the model training for 80% of data and testing of 20% of data.It is observed that the model gives less MSE value for 500 epochs for Akola, Chikalthana, and Kolhapur.For Ratnagiri and Nagpur, 100 epochs give less MSE.For Parbhani, Nashik, and Colaba the MSE values are less in the 200, 800, and 1,000 epochs respectively.

Table 3
represents the MSE values for the second approach when the model is trained considering 90% data and tested at 10% data.The MSE values are less than 100 epochs for Ratnagiri, Akola, and Nagpur stations.For Chikalthana station, 500 and for Colaba station 300 epochs give less MSE.For 200 epochs Parbhani, Kolhapur and Nashik stations give less MSE values.Table4shows which approach gives the least MSE values and the epoch number.Except for Parbhani and Colaba stations, all stations give good predictions for the first approach when 80% of data is used for training.

Table 7 .
Least RMSE values

Table 8 ,
80% of the entire training set is taken for training and the remaining 20% is taken for data testing.Less MAE values are obtained for 500 epochs for Kolhapur and Nashik stations.For 100 epochs Chikalthana, Parbhani, and Nagpur give less MAE values.For 300 epochs, Akola station gives less MAE.Colaba and Ratnagiri station gives less MAE for 1,000 epochs.Table 9 shows the MAE values when 90% of training data is considered and 10% of testing data is considered.MAE values are less for 200 epochs for Parbhani and Kolhapur stations.For 100 epochs less MAE value is obtained for the Akola region.For 500 epochs Nashik, Colaba, and Nagpur regions have less MAE.For Ratnagiri and Chikalthana stations, less MAE is obtained for 800 epochs.Table 10 shows the epoch numbers where the least MAE values are obtained and also the approach used.Except for the Akola station, all stations give less MAE values when 80% of data is used for training.