Investigating the PageRank and sequence prediction based approaches for next page prediction

ABSTRACT


INTRODUCTION
Sequence learning is a significant component of learning in numerous areas of intelligent systems, DNA sequencing [1] and web page prediction. With the increase in information and communication, the study of web page prediction becomes one of the meaningful and challenging tasks. Moreover, Sequential pattern mining is a very active research topic, where hundreds of papers present new algorithms and applications each year [2]. They have numerous real-life applications since data is naturally encoded as sequences of symbols in many fields such as bioinformatics, market basket analysis, text analysis, energy reduction in smart homes, web page clickstream analysis, and e-learning [2]. One of the crucial branches of sequential pattern mining is sequence prediction. Given a set of training sequences, the problem of sequence prediction involves how to find the next item of a target sequence by only observing its previous items [3]. Numerous models have been proposed by researchers to address this sequence prediction issue. Some of them use, for example, neural networks, pattern mining, and a probabilistic approach. However, Markov Chains are also prevailingly used for this.
As an essential part of data mining, next page prediction has become more common in the real world. An example of this is clickstream analysis that one of its significant tasks is to find essential items and is also one of the hot topics these days. The principal aim of this article is to investigate various methods using the PageRank and sequence for next page prediction on real clickstream datasets. In this paper's scope, we focus on techniques related to sequence prediction for next page prediction. This paper is organized as follows. Section 2 introduces the background. Section 3 describes related work. In the next section, we present our proposed approach. In section 5, we offer results and discussion. Finally, section 6 concludes.

BACKGROUND 2.1. Next page prediction
The next page prediction task consists of predicting the next page of a sequence based on the previously observed pages. For instance, if a user has visited some webpages X, Y, Z, in that order, one may want to predict what is the next webpage that will be visited by that user. There are two major steps for making the next page prediction shown below.
Step 1: Training sequences → building a sequence prediction model → prediction model Step 2: (prediction model, a specific sequence) → prediction algorithm → prediction.

Common sequence prediction models
One of the typical models are proposed for Webpage prediction is a rules-based model [4][5][6][7][8][9]. The dependency graph (DG) sequence prediction model was proposed in work [10]. The authors used a prediction algorithm patterned after that proposed by James Griffioen and Randy Appleton [11]. They have presented a prefetching scheme for the World Wide Web aimed at reducing the latency perceived by users. Also, the prediction algorithm constructs a dependency graph that depicts the pattern of accesses to different files stored at the server [10]-using the first-order Markov prediction method described in [11] for file prediction, Padmanabhan and Mogul [12] constructed a dependency graph containing nodes for all files ever accessed at a particular WWW server. To effectively predict, [13] estimated from n-grams to yield the conditional probabilities for sequence prediction.
The paper [14] describes how the conflict can be resolved with partial string matching and reports experimental results showing that mixed-case English text can be coded in as little as 2.2 bits/characters with no prior knowledge of the source. The work used the concept of compressibility shown to play a role analogous to that of entropy in classical information theory, where one deals with probabilistic ensembles of sequences. [15] is a model for making sequence prediction using the text-compression method; it limits the growth of storage by retaining the most likely prediction contexts and discarding (forgetting) less likely ones. A robust model is CPT that is described in [3] as below.
The CPT's advantage is that it could compress the training data so that all relevant information is available for each prediction. It also offers a low time complexity for its training phase and is easily adaptable for different applications and contexts [3]. An improved model of CPT is CPT+. The CPT+ address this issue by proposing three novel strategies to reduce CPT's size and prediction time, and increase its accuracy. Experimental results of CPT+ on seven real-life datasets show that the resulting model (CPT+) is up to nearly 100 times more compact and nearly five times faster than CPT, and has the better accuracy than other models such as AKOM [13], CPT [3], DG [10], Lz78 [12], PPM [14], and TDAG [15]. This work aims to improve the effectiveness of next page prediction by using the integration of PageRank with a startof-the-art approach called CPT+ and providing a comparison of this proposed approach with other ones.

Using PageRank for sequence prediction
The research [16] indicates that PageRank has gone from being used to evaluate the importance of web pages to a much broader set of applications. The PageRank algorithm supporting the sequence prediction is described as below. Let SDfull be the original sequence database. After the PageRank computation is processed, the sequence database is arranged into two parts. Part 1: SDhigh is a set of sequential data series with a high average PageRank index and the second part: SDlow is a set of sequential data series with a low average PageRank. The relationships of these data sets are determined by the formula (1): Consider ℎ ℎ is a data set containing strings of the form *PPR, where* is any sequence of data and predictable PPR page (PPR always follows strings *). It is considering the websites that visit the PPR page, when the PageRank index of the PPR page is high, the more pages that will directly visit the PPR page (according to PageRank's nature). This means that the number of sequences that successfully predict the PPR page will increase. Conversely, when the PageRank of a PPR page is low, the fewer pages will go directly to the PPR page and is the number of strings that predict the PPR page's success will drop. When calculating the average of the PageRank indices on the data sequence, the higher the average number of sequential series, the more strings of form * PPR will appear. This also means that the number of predicted success sequences will be more and will be added to the number of predicted success sequences according to the CPT+ algorithm. Therefore, the integration of PageRank calculation into CPT+ is significant for predicting Web access.

LITERATURE REVIEW
This research [17] aims to predict the user's behaviour using the Apriori prefix tree (PT) algorithm. Using the popularity value of pages, the authors of the work [18] bias conventional PageRank algorithm and model a next page prediction system that produces page predictions under given top-n value. The work [19] introduced an approach for personalized page ranking and recommendation by integrating association mining and PageRank to meet user's search goals. The effectiveness of their proposed method was verified through a few experimental evaluations.
The work [20] proposed a PageRank-like algorithm is proposed for conducting web page access prediction, and they extended the use of PageRank algorithm for next page prediction with several navigational attributes. In the research [21] provides a solution to web page access prediction aiming to increase accuracy and efficiency by reducing the sequence space with the integration of PageRank into CPT+. Moreover, the work [22] proposed a method in which an ambiguous prediction problem can be resolved using web PageRank and Markov model. Its experimental result shows a reduced number of vague predictions after applying the PageRank method. Assuming a set of successive past top-k rankings, the paper [23] introduced a method for predicting the ranking position of a web page by ranking trend sequences used for Markov models training. Due to the accuracy of the low order Markov model usually is not satisfactory, the article [24] utilized popularity and similarity-based PageRank algorithm to make predictions when the ambiguous results are found. The work [25] proposed the use of a PageRank-based algorithm for the web site's graph, and they proved, through experimentation, that their approach results in more accurate and representative predictions than the ones produced from the pure usage-based approaches.
However, these Markov based models suffer from some significant drawbacks, most of them assume the Markovian hypothesis that each event solely depends on the previous circumstances, and thus, prediction accuracy using these models can decrease [26]. Furthermore, both rules-based models and Markov based models do not use all the information contained in training sequences to perform predictions, and this can severely reduce their accuracy [26]. Also, the research [27] investigated related work for web page access prediction. The research indicates that the combination of the CPT+ [26] with the PageRank is a meaningful choice for next page prediction. The scope of this research is that it deals with issues regarding predicting the next items effectively in clickstream or web access context that is being used in various areas in real life. For instance, the result of this work can be applied to predicting behaviour in the e-commerce context or predicting users' trends while visiting various websites.

RESEARCH METHOD
In this section, we first utilize K-fold cross validation to divide each real dataset into ten equal parts and perform the work [21] on training datasets. Secondly, we evaluate various sequence prediction based methods by using [28] for smaller datasets with their size shorten by the approach [21].

Integrating K-fold cross validation method to improve data mining accuracy for web access prediction
The objective of cross-validation is to test the model's ability to predict new data [29]. In particular, the K-fold cross validation method [30] divides the set of observations into K groups, approximately with equal size [31]. K is usually chosen as 5 or 10, and as K becomes extensive, the size difference between the training set and the subsamples will be smaller again, as this difference decreases, the deviation of the technique the lower the [32]. The data is trained and tested K times, each time t, trained on the set D\Dt and tested on Dt (D is the original data set, and Dt is the set test data) [30]. The estimate of cross-validation accuracy is the sum of the correct classifications divided by the number of entities in the original dataset. The purpose of K-fold cross validation is mainly used to estimate the ability of the machine learning model on invisible data.

Develop training data sets and improve accuracy 4.2.1. Data
We changed the dataset KOSARAK (collected from http://fimi.ua.ac.be/data) into a sequence database (including 100,000 sequences) in a format that is defined as follows. There is a text file which represents a sequence from a sequence database. A single space and a -1 separate each item from a sequence. The value "-2" shows the end of a sequence.

Method
We propose a combination among numerous techniques such as K-folder-validation check, PageRank algorithm on sequence database, analysing sequences to predict next pages effectively. The proposed procedure includes 4 main phases introduced below. Phase 1 : we run randomly all sequences inside the considered datasets (also the input sequence database).
Then we split the randomized dataset into 10 equal parts. The first part is used as a testing dataset called DBTest1, 9 remaining parts become another dataset for a training dataset called DBTrain1. Phase 2 : We calculate the PageRank value for all sequences in the dataset DBTrain1. Thanks to the [21] we reduced redundant sequences. Phase 3 : We analyse sequences and use an effective sequence prediction to predict next pages on the testing dataset (DBTest1). We repeat the mentioned phases for 9 remaining parts according to the theory of K-Fold Cross Validation. Phase 4 : We evaluate the accuracy of sequence-prediction-based models, analyse and draw conclusions.

Evaluation framework
We used the evaluation Framework introduced in the article [26] to evaluate prediction models. A prediction can be either a success if the prediction is accurate, a failure if the prediction is inaccurate, or a none-match if the model is unable to perform a prediction. Besides, coverage is the ratio of sequences without prediction against the total number of test sequences, and accuracy is the number of successful predictions against the total number of test sequences [26].

RESULTS AND DISCUSSIONS
After creating 10 data sets according to the above method, we proceeded to take ten training sets (with the size of 90,000 lines) of these 10 data sets to implement the solution to shorten the data series. By using the PageRank algorithm, sequential databases with corresponding precision are created, as illustrated in Table 1. In which Ri is the accuracy of the collapsed sequential databases during the i th K-Fold Check Validation. According to Table 1, the values 100, 98, 96 down to 58, 56 are the size (in per cent) of the collapsed database, respectively, compared to the training database.
Experimental results show that when applying the PageRank solution to reduce the training data, gradually set size from 2%, 4%, 6%, up to 34% (corresponding to the compact data set is 98%, 96%, 94%, down to 66%), accuracy is higher than that of the initial training database. The compact training sequence database's construction took a long time due to the large data set (100,000 lines), and the number of nodes in the directed graph is many (23,496 nodes).
According to the test results illustrated in Figure 1, the average predictive accuracy of the initial training database (sized 90,000) is 99.936%, when removing the redundant data series to the database If the collapsed data reaches a size of 66% (59,400 lines), the average predicted accuracy is 100% (an increase of 0.0621%). Figure 1 illustrates a chart comparing the average of the predicted accuracy on the collapsed datasets in size without losing the predictive accuracy by PageRank and CPT+.
Note that, when the size is reduced to 66%, the peak accuracy is 100%, and a process of degradation of accuracy is reached when the size is 62% or less. From the above experimental results, we have the basis to confirm that when using the reduced training data set of size, 66% (59,400) to continue for the next stage is the test phase (predictive) is very feasible. Comparison of web access prediction models by integrating PageRank: The empirical results detailed in Table 1 and Figure 2 show that the solution of integrating PageRank with CPT + and DG is suitable with the predicted accuracy of web access is approximately 100% for CPT+ and over 80 % for DG. In contrast, the solution of integrating PageRank with CPT (an old version of CPT+) is not suitable because the accuracy of web access prediction, in this case, has not reached 50%.
Furthermore, Figure 2 also shows that integrating PageRank with CPT+ is more effective than all the other methods (DG, Markov1, AKOM, LZ78, CPT) (see Appendix from 1 to 6). It can be easily understood that the pattern of DG stands in second place with accuracy from 80% to 93%. There is a steady increase from 80% (accuracy) at a size of 100% to 93% at 56% (reduced sequence database). Besides, Markov1 is the third place with accuracy from 65% to 83%. The figures for AKOM and LZ78 show similar trends. In contrast, the accuracy of CPT is below 50% in most cases. The figure for CPT indicates a steady fall from about 48% to about 38% (accuracy) when reducing the size of the sequence database using the PageRank algorithm. The pattern of CPT+ is relatively stable, with the accuracy reached approximately 100% in most cases. Therefore, our proposed approach to integrating PageRank with CPT+ is an effective solution for predicting web access.

CONCLUSION
In this paper, we presented an investigation about the sequence prediction used to predict the next page. Experimental results on the real dataset Kosarak shows that the integration of CPT+ with PageRank is a bit higher than any sequence prediction based models. Besides, the size of the shortening sequence database is reduced down to nearly 35%. This cutting in terms of the size of sequence databases takes advantage of the testing phase (prediction phase). Thus, the combination of PageRank with sequence processing support to the CPT+ for next page prediction. In particular, redundant sequences are removed from datasets (sequence databases). The prediction space is shorten depending on the number of removed sequences. Besides, after removing neccessary data, the accuracy is still not changed even better. From present time continuing forwards in time, we are goint to develop a novel algorithms or continue improving the available algorithms such as CPT+, PageRank to solve the next page prediction issue more effective. A part from that, a solution using big data tools is also considered to escalate the performance in terms of time execution for deal with this issue.