A time efficient and accurate retrieval of range aggregate queries using fuzzy clustering means (FCM) approach

ABSTRACT


INTRODUCTION
Big data analysis can reveal that the cost to process the data in big data environment is too high, such as peer influence among customers, disclosed by analysing shopper's transactions, social and geographical data. It is a challenging task in big data environment to quickly acquire the result for rangeaggregate queries. A range query is a common database operation that retrieves all records from database where some value exists between an upper and lower boundary. For example, list all employees with 3 to 5 years of experience. Range queries are unusual because we don't know in advance that how many rows will be returned as result. It will return any number of rows or not even any. Many other queries, such as the top ten most senior employees, or the newest employee, can be done more efficiently because these queries have an upper bound to the number of results they will return. Range query retrieval in the smaller database would be more efficient process where it will consume more time and will reduce the time complexity in case of larger database where the big volume of records are present. This problem is addresses in the existing methodology in terms of retrieving more accurate and time concerned output for the range queries from the larger databases.
In big data environment mostly the data sets are stored in distributed environment as the volume of the data is very large. The cost of distributed range-aggregate queries mainly depends on two factors. i.e., network communication cost and local files scanning cost. The first cost depends on data transmission and synchronization for aggregate operations if the files selected are in different servers. The second cost depends on local files scanning to search the selected tuples. When the data set size gets increased continuously, the above mentioned two costs will also get increase dramatically. So only when the above two There are various approaches to minimize the above two costs that has been studied in the past [1][2][3][4], but today analysts are progressively more focusing on efficient methods and tools for big data analysis. The sampling and histogram approaches were already used in bid data environments to support approximate answering or selectivity estimation. However, it could not obtain acceptable approximations for the underlying data sets.

EXISTING SYSTEM
FastRAQ method is proposed in big data environments as referred in [1]. It first divides large data into small independent partitions using a balanced partitioning algorithm, and then it generates a local estimation sketch for each and every partition. When a range-aggregate query request arrives, FastRAQ obtains the result from each partition and it gives the final result by summarizing the local estimates from each partition. FastRAQ approach is implemented on the Linux platform and its performance is evaluated using nearly 10 billion data records. The various algorithms used in [1] are stratification algorithm and k-means algorithm. The disadvantages in [1] are inefficient retrieval of results, stratification problem occurs and more time complexity.
The fast algorithms in [3,5] are introduced to speed up the range sum and range max queries in OLAP system. The most important aim of this scenario is used to pre-compute the multi-dimensional prefix sums of the data cube. The total storage need is kept as same as the data cube along with a small increase in time for the queries of a singleton cell. Since any cell of the data cube is calculated with the same time complexity as range sum query. The aggregation operations such as max, sum are used to answer unexpected run time queries. It is efficiently used for maximum range of queries using hierarchical tree structure. The disadvantage is in few cases it has issue with lower accuracy rates.
The Prefix-sum Cube (PC) method [4,6,7] is first used in OLAP to improve the range-aggregate query performance. All range-aggregate queries are processed in constant time and all the numerical attribute values are sorted in order, but when a new row or tuple is added in the cube, there is a need to recalculate the prefix sums for all dimensions.Hence, the update time is even worse.
Online Aggregation (OLA) is an important approach to speed up range-aggregate queries and is widely used in relational databases [8,9] and Cloud systems [10][11][12]. In OLA systems the background computing processes run for a long time. The returns are refined and the accuracy is also getting better in subsequent stages. But in early stages, the users cannot get an appropriate result with satisfied accuracy. Also it have expensive scenario, wastage of computed resources, reduced performance [13][14][15].
The HyperLogLog algorithm referred in [16] proves to be easy to code and efficient, being even nearly optimal under certain criteria. It is highly practical, versatile, and it conforms well to what analysis predicts. The algorithm used in [16] is "HyperLogLog with near optimal cardinality algorithm", but it has the disadvantages of lower accuracy and slow execution time.
A new algorithm representing a series of improvements by reducing the memory requirements and significantly increase the accuracy for an important range of cardinalities is used in [17]. In single pass, it computes the large cardinalities and improves the memory usage more efficiently. The algorithm used in [17] is "Hyperloglog algorithm". The disadvantage is it still has an issue with error values in few cases.

OVERVIEW OF THE FCM APPROACH 3.1. Problem statement
The existing systems still has issue with inefficient retrieval of range aggregate queries. The algorithm provides clustering inaccuracy for larger dataset. The existing system leads time complexity. Thus the overall system performance is degraded.

Key idea
The FCM approach works in the following manner. First the big data is divided into small independent partitions using a balanced partitioning algorithm as referred in [1]. After that it creates a local estimation sketch for each independent partition. When a range aggregate query request comes, FCM approach gets the result by summarizing local estimates that obtained from all partitions. The balanced partitioning algorithm works along with a poststratified sampling model. It partitions all the data in database into different groups according to their attribute values and it further separates each group into multiple partitions with regard to the current data distributions and the number of servers available. The estimation sketch is a new type of multi-dimensional histogram which is built with regard to learned data distributions. This multi-dimensional histogram can measure the quality of rows or data set distributions more accurately. It maintains almost equivalent frequencies for different values within each bucket, even though the frequency distributions vary significantly. This concept leads to reduced time complexity by splitting the overall flow of work in multiple partitions. The partitioning is done with the use of attribute values that connects multiple databases. This mechanism leads to faster query response result for every partition that could be combined later. When a range aggregate query request arrives, it is delivered in each partition. First the cardinality estimator for the queried range is built from the histogram in each partition. After that we estimate the attribute value in each partition, which is the product of the sample and the cardinality value that is estimated using the estimator. The final output result for the query request is the sum of all the local estimates from all partitions.

Post stratification sampling
Partitioning is a process of dividing the large table into many smaller tables based on the value of a particular field in a table record. Data partitioning is an important step in data analysis because it improves the query performance. It reduces the size of the data to be scanned for the query result and hence the performance gets increased.
In the proposed system post stratification is introduced to overcome this limitation which will select the stratic variable for the efficient data partitioning. Stratification is sometimes introduced after the sampling phase in a process called "poststratification". This approach is implemented when the experimenter do not have the enough or necessary information to create a stratifying variable during the sampling phase. Although the method is susceptible to the pitfalls of post hoc approaches, it can provide several benefits in the right situation. Implementation usually follows a simple random sample. Poststratification can also be used to implement weighting, which can improve the precision of a sample's estimates.
Poststratification is a method for adjusting the sampling weights, usually to account for under represented groups in the population. Poststratification involves adjusting the sampling weights so that they sum to the population sizes within each poststratum. This usually results in decreasing bias because of nonresponse and underrepresented groups in the population. Poststratification also tends to result in smaller variance estimates. The svyset command has options to set variables for applying poststratification adjustments to the sampling weights. The poststrata() option takes a variable that contains poststratum identifiers, and the postweight() option takes a variable that contains the poststratum population sizes. If w j is the unadjusted sampling weight for the jth sampled individual, the poststratification adjusted sampling weight is The point estimates are computed using these adjusted weights. For replication-based variance estimation, the BRR and jackknife replicate-weight variables are similarly adjusted to produce the replicate values used in the respective variance formulas: The score variable for the linearized variance estimator of a poststratified total is where ̂ is the total estimator for the k th poststratum, ̂ ∑ For the poststratified ratio estimator, the score variable is where ̂ P is the poststratified total estimator for item x j .

Fuzzy clustering means (FCM) approach
In our proposed system, FCM clustering can be introduced to group the similar index values by calculating the membership degree values of them. With the help of FCM clustering the similarity based grouping can be done accurately where the exact similarity of classes can be identified. The FCM clustering algorithm is given as follows: Clustering of numerical data forms the basis of many classification and system modelling algorithms. The reason for clustering is to recognize the natural groupings of data from a large data set in order to create a concise representation of a system's behaviour. Fuzzy Clustering Means (FCM) is a method of clustering which allows one piece of data to belong to two or more clusters. It is based on minimization of the following objective function:

∑ ∑
Where m is any real number greater than 1, u ij is the degree of membership of x i in the cluster j, x i is the ith of d-dimensional measured data, c j is the d-dimension centre of the cluster, and ||*|| is any norm expressing the similarity between any measured data and the centre. Fuzzy partitioning is carried out through an iterative optimization of the objective function shown above, with the update of membership u ij and the cluster centres c j by: where is a termination criterion between 0 and 1, whereas k are the iteration steps. This procedure converges to a local minimum or a saddle point of J m .

Range cardinality queries
In the existing paper [1,18,19], it uses a unique RecordId to find out whether the cardinalities that are get from different buckets belonging to the same record or not. We can adopt the algorithm explained in [1,20,21] to estimate the cardinality in the queries range.

RESULT AND ANALYSIS
Approximately 2,00,000 input data sets [22][23][24][25][26][27] are used to find the performance of FCM approach. The same data set is used for the existing system also and its performance measures are also calculated. After applying both the existing system and proposed system on the same input data sets the performance of proposed FCM approach is better than the existing FastRAQ approach. Sample input data sets used and the resultant graphs are described: The various performance measures are analysed in order to prove the proposed system is better than existing system. They are:i) time to execute the query, ii) accuracy and iii) error rate.

Accuracy performance
The accuracy of FCM approach is higher than the accuracy of all other existing approaches. This is shown in Figure 1.

Execution time
The time taken to execute the user entered range aggregate query is shown in Figure 2. The time taken to execute the query by FCM approach is less than the existing approach.

Error rate
The error rate graph for the user query is shown in Figure 3. The error rate of FCM approach is nearly 50% less than that of the existing approach error rate.

CONCLUSIONS AND FUTURE ENHANCEMENT
In this paper, a new approach Fuzzy Clustering Means (FCM) approach is proposed that quickly acquires the accurate estimations for range-aggregate queries in big data environments. For ad-hoc range aggregate queries FCM have the time complexity of O(nc 2 p), where n is number of instances in the given dataset, c is cluster and p is data points. This time complexity is better than the previous existing algorithms.For future work it is planned to investigate how this solution can be extended to the case m:n format problem. Next it is planned to analyze how FCM can be used as a tool to boost the performance of data analysis in database.