Granularity analysis of classification and estimation for complex datasets with MOA

ABSTRACT


INTRODUCTION
Complex datasets can be the prospects and inquiries they affect the data analytics. The complexity of datasets is the indication of difficulty data scientist experiences as curating the insights-a complex dataset is usually more problematic to classify than regular dataset, and generally involves a diverse set of technical approaches to figure so [1]. Complex datasets require increased effort to outline the data prior to visualization and curation. To characterize the complexity of datasets is essential as well as the forthcoming complexity is to be taken into account. Big data represents complex dataset hence massive amount of data slows the high speed computers down close to bottleneck stage in order to calculate and extract insights [2], [3]. Other implications derive from distinctive sources. Various sources can generate disorganized datasets or datasets succeed dissimilar structures. Data must be preprocessed in order to comply with primary repository format.
In order to iron out the bottleneck problem of complex dataset processing, data transformation and refining steps (pre-processing) help reduce processing power and time. Besides data mining approach based upon the integration of knowledge is introduced. The pre-processing steps of business oriented data are opted to form an ontology ambitious information system (OAIS). The knowledge base is then determined to help sort out the post-processing of interpretation. Finally, the integration of objective and subjective criteria in teaching is evaluated to develop an expert knowledge.Pre-processing of datasets incorporates normalization, attribute extraction, noise removal, classification and structure re-configuration. Nawi et al. [4] have presented an artificial neural network based algorithm for data pre-processing. The algorithm has turned out to be common and becomes an analytical tool for mining pattern recognition and machine learning. Big data has been mined using parallelism approach as introduced in [5]. This mining approach has not mentioned how to discard redundant and messy data which is important in preprocessing steps. The relation between preprocessing and complex dataset with technological approaches has been experimented in [6]. Various frameworks for analytical tools like Flink, Spark and MapReduce are also issued for complex data learning. Insights from big data curation and the infrastructure for analytics at Twitter have been presented by [7]. A dynamic role in assisting data scientists with big data has been emphasized, but comprehensive insights are not available. Data analytics from several algorithms must be aggregated into production system, but they achieve in sharing outputs for academic intellect at Tweeting.
In this research, the performance of several pre-processing models in order to specify granularity level, decrease noisy samples and correct possible error of the training samples is investigated. The main objectives are to confirm accuracy of classification, simplify the computation and to excel preprocess. Bayesian, Boosting, Nearest Neighboring and the proposed classification models are introduced in this research. Additionally, the complex datasets proceed to be executed at post-processing environment. To accelerate the post-processing calculation, the parallel processing system as presented in [8] is employed. The MOA simulation [9] results and speedup performance are summarized. In the simulations, complex datasets obtained from public repository are used. The continuing part is written as follows: Section 2 and 3 expose the theoretical context of complex dataset characteristics and the pre-processing approaches respectively. Section 4 presents the parallel estimation model. Results and analysis finally is established in section 5.

COMPLEX DATASETS
It is known that there is a debate about "big data". It is about a complexity per se. The data with difficulty in handling is the matter of size. Enormous effort in making use of big size of data, just to point out where to manipulate is mandatory. Complexity reflects a tedious task. Not to mention, even a trivial dataset can parade complexity causing data scientists hard to mine with current techniques.
Data from various senders, or different datasets from the same sender, is structured dissimilarly. For instance, one unit has few different files-while another unit stores the information on a database. Furthermore, in some of the database instance there is duplicate content which is identical to files content. To make use of data from multiple sources, without duplicating or losing information, necessitates preprocessing task [10].
As a definition of "big data," the collected data size can upset both processing units and applications used to analyze. Size can be in petabytes (PB)-the taller the dataset is, the more problematic to squeeze them on built-in-memory while processing. Let A denote a given dataset matrix which contains a rows and b columns [Ai1, Ai2, Ai3,…, Ai(b-1), Aib] for each i= 1, 2, 3,…, a. The A matrix is presumed to be a deterministic set. Obviously, state space of the dataset becomes [a, b] and computational cost is O (ab) [11].
The level of granularity is vigorous for development of full report or dashboard and data integration or visualization. It is simpler for developer to drill-down into the latest detail of datasets-nevertheless, this is a balance between data indexing and the computational cost of analytical depth. Data curation which appreciates granular drill-down deals with the involvement of bigger adhoc based amount of data due to the ignorance of data integration, summary and pre-process.
Diverse databases communicate dissimilar query languages. Structural Query Language is the principal communications of querying data from central Relational Database, but if a third party hardware is used then syntax and API have to be interfaced, and additionally communication protocols and the internal database structure must be exploited to access. Analytical tool is to be elastic in order to approve the built-in connection to destined database through API unless a bulky process of extracting data to SQL database/warehouse is invalidated [12].
Processing with multimedia data warehoused in table style (.csv) is a burden, but unstructured massive data is another tedious task, since it is a rich-text oriented dataset plus video and audio streams. Various types of data exhibit diverse rules, and compromising a single type of truth data among all is critical in order to produce decisions making [13].
Disseminated data occurs whenever data is stored in several places, for instance, at work place, in clouds, or different branches. These data is isolated and to collect them all is not easy. Not to mention, after collection-some standardization, normalization and cleansing are compulsory prior to the different datasets can be cross-referenced and manipulated. Location based dataset is gathered regarding to the related objectives and applications [14].
Lastly, not only current data is taken into account but the forthcoming speed of data (growth rate) is also considered. It is altering or rising. If the datasets are often being updated meaning that additional  [15]. In practice, complexity occurs in data then a development of analytical tools is needful and depending on (a) clustering analysis or (b) classification method. Even though such a tool irons out all data analysis problems then a dataset which appears as follows arises. Note that it is not estimated by a straight line nor easily segmentized into clusters. It is complex per se as it demonstrates spherical, recurring or loopy structure. Figure 1 shows examples of complex data traditional techniques cannot classify all characteristics.

PREPROCESSING METHODS
In this section, preprocessing approaches are described. Our proposed method which is applicable for complex data, classification algorithms and the comprehensive discussion are given.

Bayesian classification
One of the classical predictions is called Bayesian with a simple hypothesis in which all input parameters are assumed to be autonomous [16]. This classification is recognized as a minimum computational cost as well as incomplexity. Let there be m different classes (C1, C2, C3,…, Cm) and the trained Bayesian classifier expects X which belongs to class Ci with high accuracy. The classification model performs as follows: Let each tuple be an n dimensional attribute vector of X (x1, x2 , x3,. . . , xn) be n finite attributes, and suppose xi can take different Ci values, namely, P(Ci/X) > P(Cj/X) for 1 ≤ j ≤ m and j ≠ i. The Bayesian classifier calculates a probability of Ci as following P(Ci/X) = P(X/Ci) P(Ci) / P(X). The values P(X) and P(X/Ci) are approximated from the training dataset (a dimensional table with tuple). The algorithm obviously accumulates the counts due to taking a new batch of examples. The algorithm of Bayesian classification is described as shown in Figure 2.

Boosting classification
Boosting denotes an algorithm which renovates fragile learners to tough learners. The weighting parameter decomposes the matrix A into two parts equally. First half of the weight (tough) is allocated to the perfect classification part, and the second half is assigned to the misclassified (fragile) part. Poisson distribution for computing the random probability to train the model is employed. The key concept of boosting is to accept a sequence of fragile learners. Weighted parameter is applied to model which was wrongly classified in the previous iteration. Only this time being the weighting parameter alters regarding to

Nearest neighboring classification
Nearest with k neighbors (k-NN) used in classification has multiple functions which differs from other algorithms as described above. It is non-parametric which requires no hypotheses about the probability density function of the inputs. In case of unknown input distribution, k-NN is healthier than other parametric algorithms. However, parametric algorithms seem to generate few errors due to considering input probability function. This k-NN is a lazy machine learning algorithm, which analyzes data during the testing phase, rather than in the training period. An advantage of lazy k-NN is that it rapidly adjusts to any changes as it does not take a common dataset from the beginning. But a major disadvantage is the huge computational cost occurs during testing period. In k-NN classification, an input is classified by its majority of the k neighbors. The algorithm is presented in [18].

Proposed classification
The proposed method is a logistic regression based learner which incorporates classifications in order to maximize the probability of monitored values. At base level of calculation, there are diverse learning algorithms that are trained individually based upon a perfect training set. This is unlike other algorithms that opt the sample values that minimize the sum of squared errors. The proposed method involves the combination of preprocessing techniques for a post-processing of the output at deep learning level. Note that the original learners are not customized while the proposed mechanism aims at obtainable higher accuracy in classification and higher performance on complex datasets. The proposed model is trained on the metaoutputs from base level of calculation. The algorithm is depicted in Figure 4.

Granularity and performance
In a preprocessing approach, the number of classes observed for the process designates a diverse distribution of the dataset. As far as the performance is concerned, it implies the dispersion of the original dataset among the classifiers. Granularity is used to measure the level of hierarchy (in decision tree), the relative size, the detailed level, depth of penetration and scale in a dataset. Regarding to this, the performance for any classifications differs based on the number of selected classes. One reason is that the capability of learning algorithms creates fewer rational to data shortage. However, higher granularity develops the structure of a healthy model, regarding to the detail of the state space. In this research, the following focuses are fulfilled. Firstly, the dependency of the granularity level in complex datasets is investigated. The classifiers in an experimental learner with complex datasets are chosen. Secondly, these training results list the benefit of a higher granularity for all datasets. Lastly, the robust model in terms of the data granularity is further analyzed by high processing power in order to examine a speedup performance and the efficiency. The following metrics are concerned to evaluate the performance of the proposed technique. The accuracy means the number of acceptable classifications according to the total number of instances. The processing time consumed by individual classifier is quantified for the efficiency comparison. The speed-up reflects the performance of a parallel processing system in comparison with a slower version. The speed-up can be computed by sequential time over parallel reference time.

ESTIMATION METHOD
The open-source based simulation tool called MOA is employed for the analytics. Four complex datasets have been selected and the granularity analysis of preprocessing methods has been accumulated. The execution has been run on a Fujitsu Windows 8 with Intel® Core ™ i5 CPU, 2.67 GHz Processor and 8 GB RAM on board. The datasets have been selected in order that they are different in number of attributes, instances, details and size. Datasets 1, 2, 3, and 4 are run on a single server (M/M/1), and each dataset is divided into 4 subtasks to be independently processed on four parallel processors (M/M/4). The parallel processing is inclusive of splitting time and re-assembling time. Splitting is based upon software developed by [19] and the simulation model is shown in Figure 5. Performance evaluation of parallel processing for reducing of problem complexity and time is also presented in [20]. The simulation results run on one and four processing units are depicted in Table 10.  Table 1. Granularity and completeness of these four datasets can be found as shown in Table 2-5. It is obviously seen that dataset 2-4 are complete datasets while only dataset 1 is containing high percentage of missing and considerable as incomplete dataset.
Performance of preprocessing methods described in section 3 lists out all metrics, such as Area Under the Receiver Operating Characteristic curve (AUROC), Classification Accuracy (CA) and precision. Preprocessing performance evaluations for each dataset are shown in Table 6-9. In all cases proposed method outperforms marginally. Then the proposed preprocessing time in msec is taken into account in order to compute for the parallel processing (post-processing) in the simulation model as shown in Figure 5. In order to compare to other research, the Naïve Bayes (NB) in Spark pre-processing mechanism is considered. Note  Table 10. The speed-up metric for these four datasets is calculated from simulation result as displayed in Table 11. In case of dataset #3 and #4, preprocessing time improves speed-up as it differs significantly from post-processing time.       6. CONCLUSION In parallel processing system, several processing units are connected in parallel fashion with each other and this combined structure is filled with a complex dataset. Since the complexity of dataset exists, preprocessing techniques are compulsory. The proposed algorithm for preprocessing is introduced and outperforms for both CA and precision analysis compared to other existing methods. The proposed classification method outperforms and improves granularity level of complex datasets. In the end, parallel processing is employed to measure the post-processing time and speed-up metrics. It is clear that Dataset complexity and pre-processing time reflect the effectiveness of each algorithm. Speedup is based on the runtime of MOA simulation. The future research considers the approximation technique in order to lessen the processing time complexity issued by simulation. The next publication touches a concept of optimizing both CA and precision in preprocesses.