Frequent itemset mining: technique to improve eclat based algorithm

ABSTRACT

In candidate generation, each k-itemset candidate is generated from two frequent (k-1)-itemsets and its support is counted, if its support is lower than the threshold, then it will be discarded, otherwise, it is frequent itemsets and used to generate (k+1)-itemsets [4]. The first scan of the database builds the transaction id (tids) of each single items. Starting with a single item (k = 1), then the frequent (k + 1)-itemset will grow from the previous k-itemset will be generated with a depth-first computation order. The computation is done by intersecting tids of the frequent k-itemsets to compute the tidsets of the corresponding (k+1)-itemsets. The process is repeated until no more frequent candidate itemsets can be found [4].
Eclat starts with prefix {} and the search tree is actually the initial search tree. To divide the initial search tree, it picks the prefix {a}, generate the corresponding equivalence class and does frequent itemset mining in the sub tree of all itemsets containing {a}, in this subtree it divides further into two sub trees by picking the prefix {ab}: the first subtree consists of all itemsets containing {ab}, the other consists of all itemsets containing {a} but not {b}, and this process is recursive until all itemsets in the initial search tree are visited. The search tree of an item base {a, b, c, d, e} is represented by the tree as shown in Figure 1. Figure 2 illustrates how data in a horizontal layout is transformed by a set of transaction ids or tidset in vertical layout [4].  This paper starts with a formal introduction of Eclat based algorithms. Moving forward, this paper presented related works in section 2. Then it will continue with a comparative analysis based on the advantages and disadvantages of each algorithm in section 3.Section 4 discussed the enhancement of the Incremental Eclat algorithm. In section 5, this paper comes out with experimental results between previous works and the proposed enhancement and to conclude the discussion, there will be a conclusion and future works in section 6.

RELATED WORKS
Finding frequent itemset is a time-consuming process, many versions of frequent itemset mining algorithms have been proposed by many researchers that aim at reducing the time and space complexities [11].

Traditional eclat(tidset) algorithm
Tidset algorithm uses vertical dataset and a bottom-up approach for searching items in a database [2]. Figure 3 depicted the pseudocode for Tidset algorithm.

dEclat algorithm
dEclat performs a depth-first search of the subset tree. Zaki and Gouda (2003) show that diffset allow it to mine on much lower supports than the base Eclat method. The input to the procedure is a set of class members for a subtree rooted at P frequent itemset are generated by computing diffset for all distinct pairs of itemsets and checking the support of the resulting itemsets. A recursive procedure call is made with those itemsets found to be frequent at the current level. This process is repeated until all frequent itemsets have been enumerated [12]. Figure 4 depicted the pseudocode for dEclat algorithm.

SortDiffset algorithm
SortDiffset was created by Trieu and Kunieda in 2014. In general, diffset in an equivalence class should be sorted in descending order according to size to generate new itemsets represented by diffset with smaller sizes [13]. Figure 5 depicted the pseudocode for SortDiffset algorithm.

Postdiffset algorithm
The initial objective of Postdiffset is to handle the issues of big and dynamic data. Postdiffset tends to reduce a memory and spaces requirement by implementing flushing of memory prior to each itemset being visited before intersecting the next itemsets. Figure 6 depicted the pseudocode for Postdiffset algorithm. Compared with Postdiffset and Tidset algorithm, Tidset algorithm is more prone to memory scarcity. The situation becomes worse if the mining process involved a bigger dataset. As memory utilization increased, the machine's dependency on virtual memory increases. The more a computer has to depend on virtual memory, the less efficiently that machine will run [14]. As a result, the performance of the computer is compromised. Hence to prevent the problem of memory scarcity, Postdiffset included an additional step to flush the memory.

COMPARATIVE ECLAT BASED ALGORITHM
In summary, the review and analysis of Eclat based algorithms are presented in Table 1. From the comparative analysis, it shows that the Eclat based algorithm is challenged by memory utilization and processing time (depends on the size of data used). Eclat algorithms need more memory to process the data. Moving on to dEclat, even though its drastically cut down the size of memory required, but degrades with a sparse database. The introduction of Sortdiffset has reduced running time and memory usage but its cost to do the sorting. Finally, Postdiffset was introduced with the ability to reduced memory utilization. However, Postdiffset has disadvantage on the processing time that needs to be improved in the future works.  [15,16] Vertical intersection of tidlist The size of Tidsets represents support Difficult in pruning technique. The longer tidset, more time and memory needed dEclat [12] Only keeps track of differences in tidsets.
Drastically cut down the size of memory required to store intermediate results Suitable for a dense database but degrades with sparse database Need to switch between tidset and diffset for a sparse database. Sortdiffset [17] Combine Tidset + diffset, then sort tidset in ascending order and diffset in descending order -No need for switching condition -Reduce running time and memory usage Cost to do sorting Postdiffset [4] Perform tidset for first level looping and diffset for second level looping + Introducing flushing of memory for itemset being visited.
Better performance Reduced memory utilization. However, suffer from processing time.

IMPROVEMENT FOR INCREMENTAL ECLAT ALGORITHM
Previous works use a Relational Database Management System (RDBMS) for its transaction process to store transaction data gained in the data mining process. RDBMS is one that presents information in tables with rows and columns. A table is referred to as a relation in the sense that it is a collection of objects of the same type (rows). Data in a table can be related according to common keys or concepts, and the ability to retrieve related data from a table is the basis for the term relational database [18]. There is two types of SQL statement can be used to store data into the database using a single row INSERT or LOAD DATA INFILE. According to (Benjamin Morel, 2017) LOAD DATA INFILE is the preferred solution when looking for raw performance on a single connection [19]. It requires to prepare a properly formatted file first before the LOAD DATA INFILE can be executed.
In current algorithms, transaction data is stored in the database using a single row INSERT. Therefore, the statement needs to be executed multiple times based on the n-transaction of data available. Figure 7 shows the algorithm for single row INSERT. To improve on the single row INSERT into the database system, LOAD DATA INFILE statement has been used. By using INFILE LOAD statement, only one statement is executed to insert data into the database system. Since only one statement executed, it minimized the used of memory for data transaction to the database. Figure 8 shows how the improved algorithm is executed.

Experimental result
The experiment has been carried out for dEclat (Diffset) algorithm, SortDiffset algorithm and Postdiffset algorithm. We focused on the performance of processing time and memory usage between old algorithms and the proposed algorithm as shown in Table 2.    Figure 11. Experimental result for old algorithm using mushroom datasets Figure 12. Experimental result for proposed algorithm using mushroom dataset

CONCLUTION AND FUTURE WORKS
Our proposed technique shows some improvement in reducing memory usage and processing time. By reducing the number of transaction to the MySQL database, memory usage is reduced. It also affects the processing time. From the experimental result, the occurrences of the itemset would be one of the contributing factors of the algorithm performance. For future work, we can also deploy this technique for sparse datasets such as T10I4D100K or retail. Looking at the same domain, the technique would be implemented to mining infrequent itemset. Hence, we could discover either the result would be the same as frequent itemset or distinguished from the findings.