Anomalies detection for smart-home energy forecasting using moving average

ABSTRACT


INTRODUCTION
SMART Homes is one of the areas which first involves internet of things (IoT).A large number of data is collected from the home energy management system (HEMS), which also involves different challenges in different data analysis stages.Not only in the smart home field, accurate forecasting is required in different fields, such as Weather forecasting [1], patients number forecasting [2], marketing researches forecasting [3], mortality rates forecasting [4], rainfall forecasting [5], and more.The data can be collected, pre-processed, analyzed, and monitored using predictive analysis (PA), and advance intelligent technologies can help convert these data into reports, charts, and graphs.In 2019, it was reported by [6], that Malaysia Tenaga Nasional Berhad (TNB) raided many properties that were assumed of snooping on electricity supply in a bitcoin mining operation, which resulted in $25 million loss for the utility company.These problems are Int J Elec & Comp Eng ISSN: 2088-8708  Anomalies detection for smart-home energy forecasting using … (Jesmeen Mohd Zebaral Hoque) 5809 limited to Malaysia and conquered in different other countries, such as Iran, Argentina, Brazil, Venezuela, Turkey and few other European countries, and eastern powerhouse Russia.As stated by [7], "data quality (DQ) is generally described as the capability of data to satisfy stated and implied needs when used under specified conditions".To maintain DQ, it is essential to initiate the most common dimensions of DQ (i.e.Accurate date, complete data, consistent data) [8], [9], for further there are other dimensions such as consistent representation, accessibility, timeliness, and relevancy [10].It was proved by different resources about the loss of billions of dollars due to poor DQ [11], [12].Low-level DQ can cause due to wrong or missing data and is very essential to handle this type of dataset [13].It may lead to incorrect or misleading decisions, predictions, or instructions.As stated by [14], dirty data can slow down any processing depending on data analysis (DA) and even affect the total cost for the organization; the cost can be over billions of dollars per year.It was also mentioned that around 60% of data in an organization contain data issues; hence, organizations are now worried about those dirty data.
Leonardi et al. [15] states that 'PeerEnergyCloud' covered the use cases related to the application of smart home technology in different fields.It indicates various applications are responding differently to different DQ phases besides to variable grades.It addressed the most common DQ aspects faced in smart home energy monitoring systems.They are data accuracy, data completeness, and data delay.However, they did not propose any real solutions to the issues.It is also essential to overcome the issues.Moreover, the study [16] has stated the importance of anomaly detection system requirement in a residential community for electrical load dataset.Detecting real-time abnormal behaviors are essential for smart home system users and TNB.Anomalies detection can help to point out outlier data and examine details and can avoid keeping attention on anomalous meters.Accurate load forecasting has become a significant part of planning and operation for all active participants in the electricity market; it will enable effective outcomes in HEMS and provide precise prediction and healthy real-time power control [17], [18].With the new market structure in place, the penalty costs for under or over contracting electricity have significantly increased, minimizing costs and revenue losses more critical than ever [19].To achieve the goal of an efficient system anomaly detection models by using technical analysis tool is proposed.These detection models will be implemented and adapt to the ever-growing usage of IoT devices involved in time series forecasting cases.The output is detected based on past energy usage, it is known as a trend-following or lagging indicator.
The first contribution is creating an application programming interface (API) for handling duplicate features and data deployed in a smart home environment.The main contribution is modelling the unusual behaviors of this smart home environment using moving average (MA) models; it can be utilized for detecting abnormal behaviors.The previous studies detected anomalies in energy usage in a household or a building apartment.However, they did not validate the system for time series forecasting energy usage.In response, a system was developed for time series data anomalies detection that will perform other pre-processing steps, eliminate anomalies without being liable on a specific monitoring system/tool, and modularize detecting anomalies method in a certain amount of evidently best definite modules.Here, an anomaly in energy consumption is detected and removed for the predictive model.Finally, the system was evaluated using the time series forecasting model auto regressive integrated moving average (ARIMA) to the specified scenario of HEMS; the model-assisted the reduction of forecasting error mean squared error (MSE) from 0.179 to 0.066.

LITERATURE REVIEW
Most researchers focused on anomaly detection to find the abnormal data which can threaten the management system.Vikhorev et al. [20] had developed an anomaly detection model for building management that caused extreme energy usage and probable growth in carbon taxes.Sial et al. [21] had presented four heuristics to identify abnormal energy consumption using a contextual grouping of smart meters.They grouped the meters containing the similar context of energy usage within the neighborhood of the meters by using distance from K-nearest neighbor days.However, using K-nearest neighbor model labelled datasets is required, which is challenging to obtain from different households for the study presented in this paper.Another system had proposed an anomaly detector for smart home security systems using hidden Markov model (HMM).They had achieved good accuracy while classifying potential anomalies that indicate attacks [22].HMM can be implemented as both supervised and unsupervised learning.However, this author also implemented it for supervised learning.
Anomaly detection process can be beneficial for different aspects; for example, [23] had identified the anomalies in consumers' behaviors over streaming data study, since anomaly can occur due to customer profile changing (known as concept drift) in the circumstances relatively theft, scam, or damages.Andrysiak et al. [24] had focused on detecting anomalies in the last-mile radio frequency (RF) smart grid communication network to find out the energy theft, and customers using energy deliberately or unconsciously, which cause disturbance in the system.A similar work [25] proposes an anomaly detection  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 12, No. 6, December 2022: 5808-5820 5810 system for thefts detection in communication in a simple process rather than using central point or electricity meters.Cui and Wang [26] had studied electricity consumption data of school and tested five models to detect anomalies.They had projected a hybrid model that connects polynomial regression and Gaussian distribution to identify anomalies in facilities management company datasets.They had concluded that the proposed prototype could also be implemented for different types of time series datasets.However, a good training dataset will be required to train the model manually before the actual detection process.Moreover, a MA is a statistical model which does not require any training.Most of the researchers indicated data anomaly caused due to energy theft in low-voltage networks.The old technique to detect these types of anomalies is by going through irregularities in the customer billing information.This centralized method requires time to detect as it requires scanning over manually historical long-term usage data.Zidi et al. [27] indicates the use of different sophisticated methods for theft detection.Another research work was held for daily real-time usage prediction using a hybrid neural network integration with ARIMA model for daily energy usage.They also detected anomalies by finding the differences between actual and predicted usage using the two-sigma rule [28].Yu et al. [29] had applied -nearest-neighbor with a sliding window and found that it is a valuable tool for identifying anomalies in hydrological time series cases and it improves DQ and helps to make better [29].It also concludes that it is crucial to select the best combination of sliding window size and confidence coefficient for the specific use case and time series input.Hence, it is essential to select the best parameters to overcome the fitting issue.
Since time-series data may contain abnormal data, it is essential to remove these data for better forecasting.MA with a sliding window is the most straightforward technique used for anomaly detection.However, it is essential to get the tune parameters for the optimal detection system.As concluded by the researcher cited above, many had detected anomalies, but they had not evaluated after eliminating the anomalies improving the time-series forecasting.In this study, ARIMA was selected for testing the enhancement of time-series forecasting, as ARIMA is a powerful technique in which an own series history is used as an explanatory variable.Due to uni-variate modeling capability, ARIMA cannot exploit the leading indicators or descriptive variables.

METHOD: DATA ANALYTICS PROCESS WITH DATA QUALITY
Depending on HEMS, a DA model is designed and developed to analyze a smart home energy monitor system.The DA process was implemented in steps presented in Figure 1.Data issues are usually involved in different stages of DA.For instance, in the DA stages of collection, pre-processing and cleaning, and analysis, the issues that can handle are integrity, incomplete and duplicate data, and inaccurate data, respectively.Hence, it is required to maintain the primary cases for data issues in every stage of DA.

Data collection
One of the significant roles to obtain result accuracy is the data collection stage.Collecting data from a reliable source is very important to make sure data is in real-world data.Typically, collected data contains data issues.It is essential to examine the data set to help to obtain accurate results clearly.The data is collected from UMass Trace Repository, which contains network, storage, and other traces.The Electrical dataset (aggregated and individual circuits) was used as input from 'HomeC' in this research.This data set will be helpful to validate the proposed approach for detecting and eliminating anomalies in time series.

Data pre-processing and cleaning
Data combined from heterogeneous sources can contain data inconsistencies and missing values.It is very important to clean the data before data analyzing [30].Before data cleaning, a few steps of data

Data analysis and modelling
Predictive modelling will be executed by using a fully scripted Python Language.Here, in the time series in Figure 2, the output of the total cost data for a particular sure on a daily basis (Month on the horizontal axis and cost on the vertical axis) is presented.In this time series, the anomalies were detected for one month after 2nd January 2016 till 1st February 2016.
However, for forecasting 70/30 training to testing ratio will be selected, as it is one of the ideal ratios for training and testing dataset [31].According to [32], 50% to 70% of the training set will more likely help get a good model, where a total of 70% of data will be used for the training purpose of the ARIMA method and the rest 30% for testing purpose.Present-day data is calculated with the help of past data using a moving average technique.MA is categorized into simple moving average (SMA) and exponential moving average (EMA).EMA outcome is contributed from the latest data, unlike SMA.In this experiment, MA is executed on a weekday basis for SMA and EMA.The records from the last weekdays are taken into the system for current value estimation.For example, to estimate the present Monday's value, records from last Monday are used.Here, MA can identify and filter out the abnormal short-term fluctuations and smooth out the outcome.By calculating the values, it will be easy to detect and monitor the anomalies.These anomalies are one of the DQ issues [30], which occurred due to unusual behaviors such as incorrect data from the meter or energy theft.These will enable minimizing the outcomes of a variety of attacks from inside and outside structures.The processing of handling anomalies is presented in algorithm 2.  Here, mean absolute error (MAE) function can calculate the mean absolute error (output presented in section 4.2), it is used to calculate the upper bound and lower bound from the rolling mean.
Next, ARIMA model is trained in such a way that it will be able to predict based on historical observations.In this case of the energy monitoring system, the prediction will have the capacity to better forecast electricity use.This can result in decreased costs by reducing the usage of electricity.This is the domain of machine learning (ML), with a detailed group of approaches and procedures mainly suitable for value prediction of a dependent class to time perspective.This prediction model is a combination of autoregression (AR) and MA.AR is calculated using lagged values for y and MA is calculated using lagged errors, as presented in (1).
The  ̂ contains predicted values using AR and MA terms.Where, AR:n is order of the autoregression of the model and MA:m = order of the MA aspect of the model.According to the calculation, the ARIMA function is specified by three order parameters: (n, i, m).Here, i is Integration; it uses observation differences to produce the time series stationary, i.e. degree of difference.

Detected anomaly data
From the outcome of anomaly detection, it was found that there are several data anomalies in one month.These anomalies were detected by using the mean rolling trend (green line) and calculate upper bond and lower bond of usage (red line), as indicated in graphs in Figure 3, for various window sizes such as 6 in Figure 3(a), 12 in Figure 3(b), 24 in Figure 3(c), 48 in Figure 3(d) and 96 in Figure 3(e) (using SMA) and Figure 4, for various values of alpha such as 0 in Figure 4(a), 0.05 in Figure 4(b), 0.10 in Figure 4(c), 0.20 in Figure 4(d), 0.40 in Figure 4(e) and 0.80 in Figure 4(f) (using EMA).The red dot indicates anomaly detected.Data plotted outside the red line are considered anomalies.Here the detection proceeds with 24 hours window.Here, the advantage of smaller window size had increased sensitivity to the usage changes in the primary process of generating the data as required, i.e. predicted data.However, a larger window size helped to reduce data noise due to the size of the data.Therefore, the task is to select the best window size, which will provide maximum predictive accuracy and minimum predictive error.It will maximize predictive accuracy by reducing error, which is the predictive value minus the predictive error as shown in Table 1.Here, all MA output is a substantial downside due to the lagging indicators.As MA is dependent on past facts of usage before there are changes in trend, it undergoes a time lag.Electricity usage may change quickly before a MA can exhibit a new trend transformation.In this case, a shorted MA faces issue from this lagging than in a longer MA.The output of SMA was found to be the most straightforward calculation, as the average usage obtains it over a chosen time period.With the use of SMA as a rolling mean, the identification of anomaly in the time was made easily using upper bond and lower bond line.However, EMA provided higher weighting to recent usage changes, its responses more rapidly to the values changes than the SMA.EMA was not proven effective for anomaly detection in forecasting time series (comparison in Tables 2 and 3).

Error measurement
Error measurement is a statistical calculation used to obtain the model error.Here, the mean absolute error, MAE is calculated using (2).It is used to analyze the amount of error in the model sure to different MA window sizes.
The concept of MAE is used here to find the degree of closeness.It is obtained from the new estimated outcome (i.e.predicted values,  ̂) and the exact value (i.e.observed values, ).Here,  is the number of observations.The calculated output of different window sizes is presented in Table 1 and plotted in Figure 5.
The MAE shown in Figure 5 is the absolute difference between the estimated value and exact value.Here the rolling mean line is the estimated line.The window 12 MAE was considered to plot the upper and lower bound because after 12; the value seems stable.It is better to take a small window for faster processing.

System evaluation
The proposed system was evaluated by using the time series forecasting model ARIMA.The output obtained from ARIMA is evaluated by different parameters.By using forecast() function from 'statsmodels.tsa.arima_model.ARIMA'(ARIMA) class for making forecasting.Initially, the data set is splatted into a training and testing set.The training set is used to fit the model by using fit() function.It will  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 12, No. 6, December 2022: 5808-5820 5816 help to produce predicted outcomes for every element of the test set.Rolling forecasting is performed by re-creating the ARIMA model once each new observation is provided.This rolling forecasting is important for dependence on elements in earlier time steps for differencing and AR models.Using variable 'history', all elements were kept on track.This variable was seeded with the training dataset, and next, in each iteration, new elements were appended.

Figure 5. Mean absolute error
The time series forecasting graph is presented for ARIMA in Figure 6, for various window sizes such as original in Figure 6(a), 6 in Figure 6(b), 12 in Figure 6(c), 24 in Figure 6(d), 48 in Figure 6(e), and 96 in Figure 6(f) (see in appendix).Here, a blue straight line is plotted to present the expected outcome and then the dotted green line after the red horizontal line is plotted to compare with the rolling forecast predictions.It can be clearly understood, the forecasted elements illustrate trends close to the expected one, and they are on the correct scale too.The reason the proposed model outperformed the common statistical method, because the detected anomaly is based on current trend rather than fixing a threshold value.

Comparison using mean squared error
mean squared error (MSE) is the average of the squared errors used to calculate the forecasting error.Errors of opposite signs will not cancel each other out in either measure.MSE the values are all positive due to the squaring; this makes it easier to use in an optimization technique.The output for ARIMA for SMA and EMA is presented in Tables 2 and 3, respectively.Here, in the tables term "Window" indicates window size in an hour, term "Anomaly" and "Not Anomaly" is the number of usage anomalies detected and the number of usages not anomalies, term "MSE" output of error in the ARIMA model after removing the anomalies, term "Predicted" is predicted forecasting outcome removing the anomalies, and term "Expected" actual forecasting outcome in the time series forecasting after removing the anomalies.The first row of the table is the output of the training of the forecasting model without removing anomalies.Hence, the expected outcome should be close to 0.935902 to get an effective forecasting outcome.

CONCLUSION
This paper shows an effective proposal and algorithms, which will help detect anomalies in electricity use for the especially for smart home energy monitoring system.This will solve the issue of inaccurate data in data analysis stage.To achieve better anomaly detection outcome for time series smart HEMS, a practical model is presented a combination of MA and rolling mean.Besides, it also contains the implementation of cleaning data, such as the removal of duplicate or unwanted data.Handling duplicate and unwanted data will overcome the issue in the data pre-proposing stage.Next, ARIMA is implemented for forecasting time series data of smart home energy usage with detection and removal of anomalies.ARIMA is executed in the data analysis stage.Hence it was used to evaluate the proposed system.Before the evaluation process after detection and removal, the ARIMA model was tested and found that the forecasted elements illustrated closely to the expected outcome.Finally, it was proved that on detecting and removing anomalies by using SMA provides better forecasting than with anomalies.SMA helped to reduce the forecasting error MSE from 0.179 to 0.066.Moreover, EMA is not adequate for detecting anomalies compared to SMA as it reduces the expected outcome than the original one.

Figure 2 .
Figure 2. Time series plot of energy usage data

ISSN: 2088- 8708 Figure 3 .
Figure 3. SMA for different windows with upper bond/lower bond line: (a) output for window size=6, (b) output for window size=12, (c) output for window size=24, (d) output for window size=48, and (e) output for window size=96

Figure 6 .Figure 6 .
Figure 6.Time series forecasting using ARIMA: (a) original, (b) window:6, (c) window:12, (d) window:24, (e) window:48, and (f) window:96 (continue) 5811processing are held as follows presented in algorithm 1.Here, the first API, 'getDuplicateColumns' will be able to receive the dataset and find the columns name containing duplicate data by iteration process.The following API, 'dropDuplicateColumns' will receive the dataset too, and by using 'getDuplicateColumns' will find the duplicate columns and drop those columns.Algorithm 1. Handle duplicate data (rows and columns) ISSN: 2088-8708 Anomalies detection for smart-home energy forecasting using … (Jesmeen Mohd Zebaral Hoque)

Table 1 .
Mean absolute error respected to hours of window Anomalies detection for smart-home energy forecasting using … (Jesmeen Mohd Zebaral Hoque) 5815

Table 2 .
MSE with respect to Hours of Window for SMA

Table 3 .
MSE with respect to Alpha for EMA