MetOp Satellites Data Processing for Air Pollution Monitoring in Morocco

ABSTRACT


INTRODUCTION
Over the last decade, Air Pollution environmental threats significantly increased [1]- [4], and Climate change effects became many and wide ranging [5]. There is no doubt that excessive levels of air pollution are causing a lot of damage to human and animal health as well as to the wider environment. For these reasons, careful scientific research and monitoring of air pollutants is a necessity that must be exercised with a great deal of attention and precision.
Nowadays, as much as we want to quickly evaluate and conclude from existing pollution and climate data, most of the problems we face center around preparing, cleaning, processing, and transforming the large amounts of raw environmental data we receive from satellites in near real time. In our case, the raw data takes multiple primitive formats such as BUFR (Binary Universal Form for the Representation of meteorological data), GRIB 2, HRIT/LRIT, HRPT/LRPT. in this paper, we are going to present a system for processing BUFR based binary files coming directly from the satellite's sensors and transform it into a data set that is ready for data analysis specific tasks likes inference and visualisation.
The main source of the data we process is EUMETSAT. EUMETSAT is an intergovernmental operational satellite agency with a total of 30 European Member States. The organization's mission statement is to gather accurate and reliable satellite data on weather, climate and the environment around the clock, and to deliver them to its member and cooperating states, international partners, and to users world-wide [6].
The data we are most interested in comes directly from a type of satellites named Metop. Metop is a series of three polar orbiting meteorological satellites, we currently get data from two of them, Metop-A and Metop-B, they both are in a lower polar orbit, at an altitude of approximately 817 kilometres, they provide The system transforms the data from its primitive BUFR format, which is a binary data format maintained by the world meteorological organization, to comma separated files (CSV). The BUFR format is a somewhat controversial and a hard-to-work-with data format because of the difficulty of manipulating and experimenting with its encoded values.
Our proposed solution is a software system composed of multiple stacked layers. The first one decompresses and processes the BUFR binary data, decodes it, structures and combines its decoded messages under the CSV (comma separated values) format, and finally normalizes it. Because deep learning models are used in different climate related problems [7]- [9], we trained and measured the performance of an ANN based architecture when filling missing value points and interpolating new ones. The system produces a near continuous data stream on the 2-D surface of our area of interest.
The software solution proposed by this paper is a system that can be directly plugged into the endpoints of the near real time data stream, it will allow for fast experimentation and visualization of already processed raw data points coming directly from the Metop-X satellites series, it will also result in space and time reduction and optimization since it focuses on interest areas, we look forward for our solution to further improve and accelerate the research process done on top of the EUMETCAST data stream pipeline.

PROCEDURE 2.1. Data Processing
The following figure demonstrates the procedure taken to pre-process and normalize the data: In the first step, the system gets the raw tar files through the FTP protocol, after extracting the compressed files we get multiple Binary BUFR files which follow a strict naming convention in the following form (INSTRUMENT_ID-PRODUCT_TYPE-PROCESSING_LEVEL-SPACECRAFT_ID-SENSING_START-SENSING_END-PROCESSING_MODE-DISPOSITION_MODE-PROCESSING_TIME) that corresponds to multiple important variables such as the instrument identifier, orbit, and time frame, the system filters the data down to get pollution files in the time the satellite is scanning the area of interest using regular expressions on the names of the extracted files (under the pollution code name of "TRG"). What we finally get are multiple BUFR pollution files corresponding to the area of interest that are ready to be decoded. In the second step, the system uses a third party software solution named BUFRExtract [10] to decode the BUFR files into bulks of exported messages, each message containing a description of its columns and the values in each one in a text file format. In the third step, the system performs fast merge/selection techniques to combine all of the messages into two comma separated files corresponding to the scanning timeframe, one for the Metop-A satellite and the second for Metop-B. Both CSV files contain the following columns of interest: After exporting the necessary values into multiple structured CSV files, the system groups rows by location points and the exact date (Year-Month-Day-Hour-Minute-Second) and applies the mean function on the pollutant values to take the average of possible redundant measurements. In the fourth step, the system deals with cleaning data points that are substantively unreasonable using logical conditions on data points of CH4, CO2 and N2O using Z-scores.Lastly, the system normalizes all pollution points into values in [−1, 1] to accelerate convergence in the training phases, using the following formula for all three numerical variables: As a general description of the process, each half an hour, the system receives one compressed tar file through the servers' end points, the system automatically decompresses the file into BUFR BIN, selects files corresponding to the area of interest, and decodes them using a third party library (BUFRextract) to the corresponding messages and turns them into two CSV files containing all of the values of interest in near real time, this results in a considerable reduction in the dimensionality of the data and the space it normally occupies.
The second part of the system fills the missing values in the 2-D surface of interest and also generates new data points using algorithmic search and a neural network architecture to get a near continuous data stream output that is ready for exploration, visualisation, and interpretation.

Intelligent Interpolation
The prediction of missing values is based on three pre-trained Feed-Forward Fully Connected Neural network models fit to fill the missing values in the 2-D surface of our interest for the three pollutants (CO2, CH4, and N2O), and the general architecture of our ANNs is as shown in Figure 2. 3. If the Top missing point's average distance from all neighbor points is greater than 50 km, or if there is no top missing point, break the loop and finish the process. 4. If the average distance is less than 50km, predict the missing point using the ANNs models and mark the point as done and loop back to step 2. The system automatically loops over these steps until all missing values are filled (for possible predictions), the system repeats this whole procedure for the three pollutants of interest.

RESEARCH METHOD 3.1. Data Description
The first Dataset used in this study was collected in the form of bulks of BUFR message files coming directly from two satellites, Metop-A and Metop-B, and precisely from the Infrared atmospheric sounding interferometer (IASI) sensor, which is composed of a Fourier transform spectrometer and an associated integrated Imaging Subsystem (IIS). The Fourier transform spectrometer provides infrared spectra with high resolution between 645 and 2760cm-1 (3.6m to 15.5m).
The main goal of IASI is to provide atmospheric emission spectra to derive temperature and humidity profiles with high vertical resolution and accuracy. Additionally it is used for the determination of trace gases such as ozone, nitrous oxide, and carbon dioxide, as well as land and sea surface temperature and emissivity and cloud properties.
IASI measures in the infrared part of the electromagnetic spectrum at a horizontal resolution of 12 km over a swath width of about 2, 200km. With 14 orbits in a sun-synchronous mid-morning orbit (9:30 Local Solar Time equator crossing, descending node) global observations can be provided twice a day (every 12 hours), the satellites take around 25 minutes to scan The area of interest, we get pollution data from points approximately 20 km apart from each other.We constructed the second dataset from already preprocessed data points in the goal of training, testing, and validating our neural network models and solve the problem of filling missing data points and interpolating new points in the selected area of interest.

Intelligent Interpolation
We generated new empty points values in which all of the points in the area of interest are distanced from each other by 5km, the system then intelligently interpolate all empty points.

Data Collection
We collected 150 Gigabytes of preprocessed data or the equivalent of around 800 million data point to build an intelligent model capable of predicting missing pollutant values. After collecting the data set, we ran a general statistic on missing data points and we present the following results based on the sampled dataset as shown in Table 2.

Training and testing data
The data was transformed into a table where the features are the 50 nearest points and the target variable is the data point used to train the artificial neural network, the distance between the target point and the furthest point set to a maximum and the same conditions we applied when selecting valid missing points were applied when transforming the data. When training the model to predict new point values (Interpolation), the system adds new points (marked missing) so that every point has a point at least 5 km near the next one, after creating new grids of 2-D points, training sets were selected based on availability of the neighboring points.

Training
6 Models consisting of 3 fully connected hidden layers with 100, 50, and 25 neurons respectively were used, the first 3 models constructed to predict missing and corrupted values and the last 3 were trained to interpolate new point values, the training details are given as: a. All of the neurons parameters were randomly initialized using the uniform distribution between −0.1 and +0.1. b. The Mini-Batch gradient Descent was used to optimize the parameters. c. A learning rate of ε=0.001 was chosen. d. Batches of 1024 samples and 200 epochs were trained.

Validation
For the validation to be efficient, we used 10-fold cross validation technique, splitting the data set into multiple training and testing sets to verify the efficiency of the trained models and to avoid overfitting.

Interpolation Method
The system uses three pre-trained neural network models to predict newly generated points and interpolate the whole surface. The process is similar to the procedure of predicting missing values, however, the system doesn't set a threshold on the average of distances in order to break the loop of predictions. It predicts and fills all new data points at a fixed neighbouring distance of 5km, the following graph demonstrates the process as shown in Figure 4. The system predicts all points and updates the sorted list of missing points as it goes until filling all of the missing values, the only difference that this model have with the previous one is that it does not have a criteria for whether to predict a missing point or not.

Benchmarking
To measure the performance of our ANN-based interpolation system, we benchmark its predictions against two state of the art algorithmic methods of spatial interpolation, Kernel smoothing and Kriging.

Kernel Smoothing
A kernel smoother is a statistical technique for estimating a real valued function f(X) (X ∈ Rp) by using its noisy observations, when no parametric model for this function is known. The estimated function is smooth, and the level of smoothness is set by a single parameter. To put it in mathematical terms, the idea of the nearest neighbor smoother is the following. For each point Xi, take N nearest neighbors and estimate the value of F(Xi) by averaging the values of these neighbors. This type of interpolation is most appropriate for low-dimensions (p < 3) (the dimensionality curse [11] is one reason for that). Actually, the kernel smoother represents the set of irregular data points as a smooth line or surface, in our case (2-D surface) this is a perfectly reasonable solution. One way to fill these points would be to use Scipy's [12] (precisely scipy.interpolate.Rbf) implementation of Radial Basis Function interpolation which is intended for the smoothing/interpolation of scattered data.

Gaussian Process Regression or Kriging
Kriging or Gaussian process regression is a method of interpolation in which the interpolated values are modelled by a Gaussian process governed by prior covariances, as opposed to a piecewise-polynomial

Hardware
A Python implementation of the deep neural network architecture with hidden layers of 100, 50, 25 number of neurons (respectively), Google's TensorFlow [13] library was used to build and train the model. An NVIDIA Tesla K80 single GPU device, with 4992 CUDA cores, 24 GB of GDDR5 memory, and 480 GB/s aggregate memory bandwidth was used to train the neural network models.

RESULTS AND DISCUSSION
The resulting solution is a system composed of three layers of processes, the first layer decompress, decode, and normalizes the data. the second layer is a three ANN stack to fill in the missing pollutant values, and lastly the final layer which is composed of another stack of neural network models to interpolate new data points in our area of interest.

Data Processing
The decompressing, decoding, merging, cleaning and normalizing of Raw BUFR data result in a considerable reduction in resources. Since our algorithm runs in linear time, and considering the volume of data the system processes at each step, a simple computer configuration (4 Gigabyte RAM, 4 cores with no parallelism) result in the following durations as shown in Figure 5. These tests were conducted multiple times for each volume category, to ensure high precision. We conclude that the system scales pretty well and can process large volumes of data (up to terabytes per hour) in relatively short duration of time.

Intelligent Interpolation
The resulting surface of interest is a 1887 by 1776 km2 rectangle, the system predicts a maximum number of 123,568 points, the following figures showcase examples of predictions in a fixed date, using kriging, smoothing and our neural network model as shown in Figures 6, 7, 8. In the above figures, the rounded markers represent a known sample of pollution data points, and the interpolated surface represent the the resulting predictions. We got the following training results after cross validating the models as shown in Figures 9 and 10. As expected, the system produces better results when filling missing values, and generally worse results when filling in new data points. but when comparing interpolated data using the 3 methods, we find interesting results, the following graph showcases the results of comparisons.

Discussion
As we can see from the results, the optimal interpolation technique is generally better than our trained neural network models, however, in the case of N2O and CH4 we can say that our model is competitive with the other two classical 2-D interpolation algorithms, and since we had 70% missing data, that opens the possibility of better performance with greater volumes of data, if trained on larger volumes of data, our system can make better predictions and therefore introduce an optimal solution and a competitor to the kriging or smoothing interpolation algorithms.

CONCLUSION
At the present time, the size, variety and complexity of raw data is huge and continues to increase every day. The use of data processing systems to store, process, and analyze data streams has changed how we discover and visualise big data in general. In this paper, we presented a software solution composed of multiple stacked layers of subsystems that transform and process considerable volumes of raw pollution data in near real time, taking the data from its native compressed format to a structured, cleaned, normalized, and continuous data stream that is light and easy to experiment with.
In the future, significant challenges and problems concerning Big Environmental Data must be addressed by the industry and academia, current work on topics ranging from utilizing AI for plant monitoring [14], working on social awareness concerning climate change [15,16], and the use of biological methods [17] to fight climate change is important. But new challenges to tackle are in the field of environmental data science, future work focused on how to build new environmental data learning paradigms, scientific computing environments, and an all around better infrastructure for pollution monitoring is a necessity for all of us.