A comprehensive insight towards Pre-processing Methodologies applied on GPS data

Reliability in the utilization of the Global Positioning System (GPS) data demands a higher degree of accuracy with respect to time and positional information required by the user. However, various extrinsic and intrinsic parameters disrupt the data transmission phenomenon from GPS satellite to GPS receiver which always questions the trustworthiness of such data. Therefore, this manuscript offers a comprehensive insight into the data preprocessing methodologies evolved and adopted by present-day researchers. The discussion is carried out with respect to standard methods of data cleaning as well as diversified existing research-based approaches. The review finds that irrespective of a good number of work carried out to address the problem of data cleaning, there are critical loopholes in almost all the existing studies. The paper extracts open end research problems as well as it also offers an evidential insight using use-cases where it is found that still there is a critical need to investigate data cleaning methods.


INTRODUCTION
The utilization of the Global Positioning System (GPS) has been increasing since the last decade as it is one of the most cost-effective navigational assistance [1]. With the proliferated usage of smartphone, various navigational applications and location services are directly dependent on the GPS data. The GPS system extracts the signal information from the satellites in order to obtain the location-specific information. On the basis of the usage of the different GPS receiver, the information of the location is generically provided in the form of Longitude, Latitude, and altitude [2]. The interesting factor about the GPS signal is its publicly availability and accessibility. From a technical viewpoint, the time factor and the spatial factor are the sole backbones of GPS satellites that bear an atomic clock with superior synchronization capability. They are also capable of rectifying and compensating any form of drift in the clock timing with the ground devices very spontaneously. A radio signal is being consistently transmitted by all the satellites of GPS that consists of updated positional data and time information of that position. It is also believed that latency between the GPS satellite is highly dependent on the distance from the earthly receiver and GPS satellite as it is free from any dependency of the speed of satellite and moreover the radio-waves have uniform velocity [3]. There is a typical computation carried out by the earthly receiver too which is responsible for computing the appropriate positional information after it obtained multiple data from multiple satellites. The computation will need to be carried out with higher accuracy. It is also believed that in order to compute the position information by the GPS receiver, there should be the presence of at least four GPS satellite within a line of sight. Although, this is a condition and this condition is quite hard to be satisfied in many real-time cases. The signal forwarded by the GPS satellite consists of much essential information. The first form of content is a code which bears pseudorandom characteristics. Information of this code is only identified and inferred by the GPS receiver. The receiver can obtain the epoch on the basis of multiple parameters from this code. The second content of the GPS signal is the message that bears the information of the position of satellite and transmission of the particular epoch. The receiver then computes the time of flight on the basis of these two parameters, i.e. time of arrival and time of transmission and this computed information is required by any users [4]. It should be known that the offset of the clock maintained within a receiver as well as the position of the receiver is something that is required to be computed in parallel to each other. Finally, the obtained information is converted to longitude, latitude, altitude, speed, etc. which is forwarded to the navigational system of the user. The map-update, traffic monitoring, etc kinds of application uses GPS sensor to record the 3D coordinates X, Y, Z with a time stamp characterized by its value, that too have another characterized like volume. If the period of recording is very large as well as if any system failure happens then its will have veracity. The veracity as uncertain or missing data or redundant data plays a crucial role in the operating of the accurate traffic management system. The large GPS data traces consist of all three characteristics of volume, value, and veracity along with velocity and variety. This large volume of the spectral data poses huge critical challenges during the data processing most importantly their large volume of data with a low grading a mix blend of raw data and the uncertain aspects impacts largely the data analytics process in both viewpoint of data science and engaging to mine the useful insight. Therefore, this paper reviews the existing system of GPS data pre-processing. Section-2 discusses the essential of GPS data followed by essential of the pre-processing mechanism of GPS data in Section-3. Section-4 discusses existing research work followed by a briefing of open-end problems in Section-5. Finally, Section-6 briefs about the conclusion and future work direction.

ESSENTIALS OF GPS DATA
Before briefing about GPS data, it is essential to understand the fundamental structure of it. It consists of a receiver, ground control station, and satellite. At present, the GPS system transmits the signal using two significant frequency levels i.e. the first one is 1,575.42 MHz and second one is 1,227.60 MHz. The widely used commercial application used by common people uses GPS signals that are encoded by course/acquisition code, and this encoding system involves codes of identification of all the satellites. Special accessibility is also given to military application where the GPS signal is encoded with precise code [5]. Although, there is a sophisticated process to ensure that data offered by GPS is accurate, but still various external factors have an impact on the accuracy (Figure 1) viz. i) effect of troposphere that causes radio reflection inducing errors, ii) effect of ionosphere causing much slower speed of signal propagation causing error in transmission process, and iii) effect of multipath transmission causing adverse effect of reflection due to many physical structures on the ground [6]. In present times, a standard measure of Dilution of Precision or commonly known as DoP is used for checking how much is a degree of degradation has been invoked on GPS data in terms of accuracy [7]. Normally DoP value is smaller of there is a non-uniform position of GPS satellite or else DoP value is found higher. Another essential parameter is signal strength which signifies the level of signal stability too during the reception state. Normally, GPS signal becomes unstable in the presence of obstacle or artifacts causing weaker strength of GPS signal. The third factor towards holding GPS data accuracy is aquantity of satellites with GPS capability. More the number of satellites, reliable are the positioning values. Discussion of such external factors causing degradation of GPS signal is publicly known; however, there are various internal factors too which is being investigated by the research community since the last decade. There are various forms and types of research-based solution to address this problem of pre-processing artifacts from GPS data. The standard process of performing this pre-processing the artifacts from GPS data are classified into two types viz. statistical-based approach and logical-based approach [8][9][10][11]. These methods were used traditionally to preprocess GPS data due to its behavior of less susceptible to the error streaming from sampling intervals. The density of the data point in GPS signal correlates itself with the probability factors of many notions of the vehicle moment on-track or off-track. Because of this, the low-density data point is considered an outlier in the case of GPS data. Majority of problems of artifacts in GPS data results due to missing data and following are the standard procedures to deal with the situation viz.  Outlier removal: Traditionally, the data point of the GPS signal is initially sorted with an intelligent sense of either ascending or descending with the distance and medium, and the consistent data is taken as for further processing. The simplified method called Kerner density is used to get the density of the data points, and low-density data points are considered as redundant data. Other methods include adaptive density optimization, region growing clustering with knowledge. Most of the methods fail to handle outlier removal in the situation of the high-density data.  Trajectory Filtering: In trajectory filtering the GPs data position accuracy is aims to be improvised.
The approaches of adaptive Kalman filtering, particle filtering based methods are developed to smooth the noise than ensure reduction of error in the values of the data point. These filters interoperated the position and speed, but that is a computationally complex task.  System Model for GPS Data Collection: The typical system model for the GPS data collection includes N user or custom devices equipped with the GPS sensor s.t D = {N1, N2, ...Ni}, where i =1 to N. The Di record the data points of the GPS sensor and get logged with the local buffer, which is synchronized with a access point to cloud for the continuous stream and update of the data to the cloud and further for the numerical computing environment setup on the on-premise system A typical system architecture of the data generation, storage and the processing of the GPS data is shown in Figure 2.

PRE-PROCESSING GPS DATA
According to the existing research studies, all the process associated with the artifact removal of GPS data uses time-series analysis method and is broadly classified into two classes i) statistical based approach and ii) logical based approach. The research study on each approach is discussing as follows: i) Statistical based approach: Quantitative or statistical method is considered as one of the effective approaches to identify the best item sets and cleaned the datasets which are statistically closest to a user-specified data set [12]. Usually, the GPS data pre-processing method follows two significant phases; i) Error Detection and ii) Error Repairing [13].

Qualitative error/anomaly detection
This form of detection method deals with exploring statistical errors as follows;  Error type: It relates to the search for the type of error and selecting the appropriate method to describe the patterns of legal data instance. Example-integrity constraints, first-order logic by the fractional method, functional dependencies, and denial constraints.  Automation: This method clarifies how users are involved in the error detection method. Examples are the detection of functional dependencies andtracing all the replicated entries of data [14].  Business-Intelligence: There are good possibilities of artifacts to occur on BI stack, likeerror-prone data are usually transmitted through certain communication channel with data processing capability. Meanwhile, majority of the strategies deals with tracing of the artifacts in data over actual database. Statistical Artifacts Tracing Taxonomy as shown in Figure 3.

Artifacts repairing method
Various instances of data are identified in this mechanism for ascertaining the essential quality demands of dataset. Similar as an error detection method, this method also addresses three significant questions like What, How and Where to repair. Error repairing method contains 3 phases as shown in Figure 4, viz.; i) Repair Target, ii) Automation and iii) Repair Model.  Repair Target: This process makes a different assumption about data and quality rules, e.g. trusting declared integrity constraints, trusting the complete data, allowing constraints relaxation, exploring the changing possibilities of data and constraints. However, most of theapproaches deals with rectification of data considering over a set of artifacts while there are also presence of approaches towards involving communication medium as a root cause of errors.  Automation: Specifically, error repairing techniques are classified according to the user's involvement (i.e., Where and how humans involved). Some of the existing techniques are fully automated (e.g., database recovery). Other techniques involve human interaction during the repairing process which verifies the repaired work or incorporate training operation in order to carry out involuntary decision of repairing [15].  Repair Model: The existing methods repair the database in situ and destruct the database. The queries answered by repair model, sampledconsidering various possibilities of rectification with parallel solution towards the probabilistic approach [16]. Some popular error repairing methods are discussed in [17][18][19][20][21][22]. Due to the increasing of analytical complexities, it is essential to know the statistical implication of data pre-processing. There are multiple techniques that exists which enhance the accuracy or efficiency of data pre-processing through statistical approach, e.g. Machine learning method. Some popular data pre-processing algorithms are discussed as follows. Active-Learning for crowdsourcing is slowly increasing in popularity. Crowdsourcing is rapidly adopting in business fields for data pre-processing [23]. In the educational sector, there is an increasing necessity to address such complexity problemand multiple recent research studies employ an Active-Learning approach to solve the crowd queries [24][25][26]. The supervised learning methods (i.e., Support Vector Machine and Random forest) are a most important method to formulate the user input to data pre-processing, and Active-Learning is an algorithmic approach which elects the most informative datasets to acquire.
The several statistical data pre-processing approach has been presented in existing archives of research publication to more precisely and accurately clean the data repository e.g. The famous project "Eracer" was used for depicting the core process of data pre-processing over the noisy data can offer dual stages of learning operation. The famous graph-based methods are used for representing the message passing and relation algorithm which solve the inconsistencies [27]. Additionally, there are several recent approaches to represent the statistical outlier detection methods like [28,29]. In [30], authors employ a machine learning approach to improve the pre-processing data reliability.
The extended work of performing sophisticated data preprocessing associated with clean the data and precedes machine learning training model is called Active clean [31]. This approach employs a selection method for most significant data and methods to rapidly update the machine learning model given new clean data. According to the study of [13], specific numbers of considered dataset are subjected to cleaning process while surplus data are further subjected to training. GPS trajectories or sensor data reading sequences are composed of imprecise or error-prone values. Even business database could be error-prone [32]. The existing approach of sequential data pre-processing considered the constraint associated with speed that is linked with consumption of fuel [33]. Determination of the errors associated with huge spikes can be carried out by constraint, while constraints based pre-processing repairs the dirty values with respect to mini/max speeds. However, the constraint associated with the speed is not successful for determining certain errors which is at par with the practical constraints of speed. For better investigation, it is essential to consider smaller version of errors. One small example to talk about is when there is a deviation of 1m over the readings of GPS. Apart from this, aggregating a massive number of errors, mining results can be seriously misled for example; not able to create clusters in inaccurate GPS readings with multiple small errors [34].
Furthermore, noise is usually associated with GPS raw data [35], and it increases an uncertainty signal on results that are undesirable to the authors and industrial engineers in general. However, the evaluation process defines how a dataset is reliable which include GPS error detections and missing data. Also, this evaluation includes sample size, rate, spatial coverage and existence of additional data type (i.e., weather). In the research study of Vitor et al. [36] investigated the limitations of prior work on the topic of data quality indicators (i.e., floating car data). Authors leveraged on the number of statistical indicators covers a number of statistical indicators including; reliability, accuracy and city spatial coverage and evaluate the specific data quality. The statistical indicators rely on a sequence of statistics, clustering and external data elements like road maps.  Yuki-San Method: Such approach is used for settling various forms of statistical indicators which are basically of two type's viz. i) value: it represents the quality of the data, ii) veracity: it is mainly associated with data reliability from the source point. Such GPS based values are represented in the form of granularity and coverage factor. Micro-Temporal coverage (analyze the day time temporal coverage) and Spatial coverage (provide real-time spatial information). While, veracity is enumerated as; Missing data (compute any signal gaps from the dataset), reliability (measures the logical precision) and accuracy (spatial precision of GPS devices).  Indicator of Spatial Coverage: This term is associated with the measurement of the distance based diversity of data of vehicle. Such values usually increase with the more density of traces of GPS. The entire process of spatial coverage is illustrated in algorithmic structure [36], where a set of traces associated with GPS over a defined Grid-Cell (Sgc) is weighted based on its relevance and formula of spatial coverage indicator can be represented as; Missing Data , Where RFC is a complementary Gaussian error function, P is a number of packets lost, and G is granularity.  Reliability: The reliability covers the dataset objectivity, and it is computed as; Reliability , Where: (at)awake trace ratio, (aT)awake trip ratio, (rt)reachable trace ratio, and (rT)reachable trip ratio.  Accuracy: Accuracy measured by inconsistency among the positions of GPS device and vehicle true location. Authors formalized the accuracy indicator by algorithmic form, and its resultant equation is defined as follow; Accuracy Acc (median (eT), Where T represents Error of each trip. Yuki San method has experimented on data aggregated from four wheelers inSan Francisco and Nanjing. From the obtained results authors analyzed that proposed Yuki San method is very potential to uncover the value in floating car data sources in an automated manner.  You-Sense Tool: It is a monitoring tool which collects the GPS raw data via a mobile application.
It tracks the position with GPS, Wi-Fi, and accelerometers. The advantage of YouSense is data pre-processing and data analysis. In [37] authors investigate multiple filter criteria for YouSense GPS data-pre-processing by statistical analysis of different person's dataset. YouSense collected the GPS data records and displayed according to the time stamp of GPS chip, and corresponding parameters are; Time millis, Longitude, Latitude, Accuracy, Altitude, Speed and Bearing. However, collected GPS data records provide high accuracy position data, but this data contains gaps (i.e., missing data errors). This data gaps may be planned gaps (i.e., the phone is not in operational mode, GPS device is switched off mode) or unplanned gaps (i.e., phone battery is dead, GPS device unable to receive signals). Hence, to resolve this kind of data gaps the dataset need to clean by i) filtering the wrong location information, and ii) fill-up the gaps during GPS device is switched off mode.
To understand the raw GPS data ( Figure 5), authors developed a "Quantum Geographic Information System," i.e., QGIS tool that visualizes the GPS data (i.e., GPS viewing, editing as well as analysis). Also, this supports web map services. To repair GPS sequential trajectory data with the considering the variable as x = x [1], x [2]. In this case, x[i] is considered as ith point of data over a domain of finite structure. There is a specific timestamp ti linked with xi as well as artifacts with certain predefined range θi. There are various possibilities that the range of θi differs from each different forms of data which actually affects the accuracy score of the GPS readings. There are also good possibilities of directing a maximum value of θi for depicting highest possible artifacts for all forms of sequenced dataset.

Figure 5. Visualization of raw GPS data with multiple gaps in the GPS trace
The above (1) displays past-probability (P(x)) or also known as a likelihood of sequences x with respect to speed changes. Q (ui) exhibits the future probability of speed changes ui, and P(ui) represents the corresponding (log)past-probability where empirical distribution of probability Q carried over the speed factor that alters and can be determined using simplified statistical feature over the same sequence. Authors have formulated an issue associated with the rectification of the sequential data over a vast probability of computational complex problem [38] for the purpose of evaluating practical GPS data aggregated over using smartphone while the subject is mobile over the observation area. The presented study has considered comprehensive test environment with inclusion of errors. However, the only parameter to be identified in δ connected with cost associated with rectification of data as shown in Table 1. GPS trajectory data analysis is the trending research topic mainly used for transportation mode detection via GPS data analysis. There are diversified properties associated with the determination of mode of transportation (e.g. speed, latitude, location, longitude, acceleration, etc). Unfortunately, there is no inclusion of any mode of transportation charecteristics over the aggregated GPS data. The study carried out by [33] has presented a discussion of entropy factor PE considering the mobility factor. A classifier design is developed for using learning machine is used for minimizing the training time without compromising accuracy.  Permutation Entropy: This mechanism is used for identifying all the dynamic alterations of the computationally complex aspects. The variable PE is associated with the original series of time basically represents a Shannon entropy for all K symbols. Its mathematical representations is (2), Where m represents the embedding dimension, Pj represents distribution of probability factor associated with all the series of diverse symbol.  Extreme Learning Machine (ELM): It is form of machine learning approach that targets using single hidden layer while a conventional training mechanism of feed-forward approach. The speed of the training using this approach is quite faster as compared to any legacy machine learning of neural network. The experimental analysis of such an approach is as follows: The Authors considered ''Microsoft GeoLife dataset" which includes 17621 moving trajectories of 182 users in 3 years. These trajectories were recorded by different GPS loggers and GPS phones. Authors extracted the features from each trajectory and categorized into basic features (Average velocity, velocity variance) and sophisticated properties e.g. sophisticated features and PE of velocity). The outcomes of training and testing from the features are shown in Table 2.

EXISTING RESEARCH TRENDS
Apart from the standard methodology of GPS data preprocessing, there is various research contribution towards addressing data cleaning problems. The existing studies are broadly reported to adopt 4 different approaches, e.g. i) statistical-based approach, ii) logical approach, iii) outlier-detection approach, and iv) trajectory-based approach. The statistical-based approach is developed emphasizing time-series, prediction, trip detection, quantitative patterns, machine learning [39][40][41][42][43][44][45][46][47][48]. The existing logical-based approaches are reported to consider velocity constraints, reduction of travel distance, and human navigational system [49][50][51]. Nearly, similar problems are also considered when working with outlier-detection based approach where the consideration of driving behavior, statistical process controls, partitioning is carried  55]. Trajectory-based approaches are found to use security factor, congestion analysis, clustering, mining, updating map, similarity assessment [56][57][58][59][60][61][62][63]. Table 3 summarizes the research contribution of present times with respect to different parameters to exhibit that all the problems are associated with advantage as well as significant limitation too.

OPEN END PROBLEMS
From discussion made in the prior sections, it can be seen that there are various standard and unique approaches meant for addressing the data cleaning problems in GPS signals. However, it can also be seen that the majority of the researchers have not much considered about the problems associated with the signal lapse of the GPS data. The prime reason behind this is the usage of the standard dataset which misses these problems. Generally, information about such signal lapse can be obtained from the GPS device that obtain signaling from multiple GPS satellites. Such forms of dynamic data cannot be obtained from the standard dataset as they are a direct representation of any form of consistent interruption in GPS data with respect to time. Hence, there is a significant skip of problem consideration while attempting the GPS data cleaning process. It should also be known that consideration of such problem is of higher importance as they are highly practical and inevitable owing to the presence of different forms of infrastructure on the earth surface, e.g. trees, tall buildings, etc. A closer look into all the existing approaches exhibits that various methods indirectly attempts to solve this problem with the aid of time series analysis skipping the lapse factor. Recent works are not found to have any such consideration. However, a work carried out by Wheeler et al. [64], and Lachowycz et al. [65] have a unique approach where the authors have used the raw GPS data in order to check the lapse factor. This implementation permits various other forms of time-series data to be aggregated while investigating the lapse factor by retaining contextual spatial data as well as data obtained from accelerometers. However, this approach is only valid for outdoor applications and not indoor application resulting in missing data if the indoor application is considered. In the same year of 2010, there was a work carried out by Oliver and Badland [66] where the study ignored the participant-based information which fails to meet their critical factor. The next research methodology attempted for missing data was by using imputation technique by Troped et al. [67]. Irrespective of a slight difference in all these approaches, a common trait of usage of spatial data and temporal data is found to be used; however, all them serious misses any form of computational modeling for performing validation or benchmarking of the presented approaches of dealing with missing data from GPS signal. Eventually, the researchers working on standard dataset also ignored the fact that there is always a certain amount of error even in standard GPS data as such data are never claimed to consider any form of environmental factors. If such practical parameters are not considered in the dataset than there is always a fair chance of error degrading the accuracy of the analysis. There are various use cases to represent that missing data could significantly degrade the data quality of GPS Signal.  Use-Case-1: The first use case is very common to everyone and is termed as a drifting problem that is highly inevitable and results in missing data. Figure 6 highlights the GPS traces of the dense forest area where it can be seen higher accuracy of tracking being maintained on the road area, but it starts showing random position when it enters the forest area. Hence, the positioning data in the forest area is missing, and there is no existing approach to address this missing data problem.  Use-Case-2: This is another most encountered problem in GPS signal receiving characterized by signal attenuation problem. Figure 7 showcases a straight line in the circle which is a false route in the terrain region. In such case, a linear line is drawn between the source and destination point which is highly inaccurate proving the complete loss of data. None of the existing research work has emphasized on this problem of missing data till date.  Use-Case-3: This problem is usually more encountered in the urban area and very less in the rural area, and it results in bouncing issue of GPS signal. Figure 8 highlights three locations where the scattered GPS signal is received owing to the presence of tall buildings. The navigation system shows some separate tracks even on a straight road or vice-versa as they are incapable of tracing the original signals.
Unfortunately, such problems also directly contribute to missing data where there is no effective solution found in the existing study. From all this evidence, it is quite clear that there is a critical need for a reliable GPS service where the solution cannot be towards the external parameters but should be more focused on internal parameters.

CONCLUSION AND FUTURE WORK
This paper has presented a review of existing approaches towards GPS signal data and its quality. From the quality viewpoint, the paper attempts to highlight that data pre-processing of the GPS signal is one of the essential operations. Existing approaches towards data cleaning process are found to adopt many sophisticated and complicated approaches and have come up with the diversified result. However, their results are achieved under consideration of a specific error-free dataset and controlled research environment. As consideration of such forms of data itself should be re-thought of as without the presence of possible errors discussed in the form of use-cases in the prior section, it is not possible for addressing missing data problem in GPS signal.
Therefore, the future work direction should be towards considering the adoption of such dataset which has such characteristics of errors. Due to non-availability of such dataset, the future work will be initiated to perform computational modeling for error incorporation in such a way that there is a missing data mapping with the cases illustrated in prior section uses cases. As a part of the solution, the future work will be then focused on offering correlated data which higher probability of matching with the missing data. The work will be carried out using a combination of statistical approach, probability theory, and time-series analysis in order to evolve up with a new computational model. The performance of such a model will be assessed using certain benchmarked practices as well as comparative analysis with extensive numerical case studies.