A novel approach to dynamic profiling of E-customers considering clickstream data and online reviews

ABSTRACT


INTRODUCTION
The problem of load profiling and classifying customers has been studied extensively. Customer profiling approaches were used in e-commerce where systems keep a variety of customer information such as navigational and behavioral patterns using K-means algorithm for accurate profile construction [1], [2]. A. D. Rachid et al. [3] proposed a customer prediction model in an e-commerce context, wherein a clustering phase is based on the integration of k-means method and Length-Recency-Frequency-Monetary (LRFM) model. Girish S et al. [4] integrated classifier to predict the type of purchase that a customer would make, as well as the number of visits that he/she would make during a year. User profiles can also be created through multiresolution clustering designed for smart metering data [5]. Smart meter technologies allow retailers to supervisor individual's consumption in real time. Efforts have been made in mining those time series data for user profiling: Y. Wang et al. [6] used data gathered by Advanced Metering Infrastructure to better understand electrical consumption behavior, Q. Wang et al. [7] attempted to automate spike detection within large volume of smart meter data for load profiling and Y. Lu et al. [8] investigated the load profile clustering of Int J Elec & Comp Eng ISSN: 2088-8708  A novel approach to dynamic profilling of e-customers considering click stream... (Houda Zaim) 603 smart grid customers using an adaptive weighted fuzzy clustering algorithm. In [9] authors developed a classification system to identify a specific user among several users by investigating different metadata characterizing load profiles, including raw measurements, frequency characterizations and typical load shape indexes. A. E. Frosta et al. [10] tried to reduce the data to be processed during the customer's load profile. Typical Daily Profiles (TDP) and Typical Weekly Profiles (TWP) are compared to see how the time resolution of data affects the clustering. In [11] the profile is acquired by asking users explicitly to introduce and update their profile manually and next monitoring the browsing behavior. A framework to integrate user profiling and customer journey is proposed by authors of [12] using user interviews and semantic analysis to identify the target audience. Marketing actions are then set up for customer journey and finally conversion stage to convert user into customer.
Even though these researches reflect customer consumption behavior, it ignores data from customer's click stream and online reviews rates. Moreover, though the research reflects various customer behavior data for creating customer profile and recommendation, judging and evaluating e-commerce website characteristics based on customer's feedbacks is ignored.
For business e-commerce website's evaluation, P. Dong [13] established some site evaluation index: information content index, site survey index and technical index. T. Hariguna et al. [14] aimed to analyze empirically three factors antecedents of trust they are system quality, information quality, and service quality. The results of this study concluded that those had positive impact on customer intention to purchase in e-commerce transactions on social media. T. Singh [15] identified factors related to usability of e commerce website which are: user satisfaction simplicity, attractiveness, speed, efficiency, and searching product information. A survey used these factors as a valuable input from user for assessing the usability of e-commerce website. A framework was proposed by authors of [16] for evaluating the impact of four parameters on the success of e-commerce: customer satisfaction, costs, awareness & knowledge and infrastructure. Authors of [17] believed that all the success factors for supporting on-line shopping should be considered from multiple perspectives (technological perspective such as web design usability, social perspective such as social networks, etc.), across different e-commerce life cycle stages (pre-sale, information stage and supply chain).
Various e-commerce website attributes aimed for data collection are reviewed keeping in mind the areas where sequence of web events generated by each user is required to assess the online service quality. Therefore, we discuss general customer's shopping steps in online stores. H. Sergio et al. [18] analyzed sections visited by users, navigational paths followed when accessing specific pages of the website, and the relation between different web sections or the sections that lead users to buy products. To improve user fulfillment and shopping experience, it has become a general practice for online sellers to allow their users to review or to communicate opinions of the products that they have sold. The major goal of the paper [19] was to solve evaluation feature extraction problem and opinion classification problem from customers using feature words and opinion words from product reviews. G. Silahtaroglu et al. [20] collected data about customer's mouse movements, their demographic information and items they added to their shopping baskets. A global view on the sequential online-shopping events was proposed by authors of [21] where data sets consist of two parts: a collection of click events and events attached to corresponding clicks. R. Purwaningsih et al. [22] pointed out that factors affect consumer interest in online shopping are: Perceived Concentration, Perceived Enjoyment and Perceived Ease of Use.
Identifying these dimensions values implicitly based on e-customer click stream is our second task. The analysis of these data should be made to finally load the most accurate customer profile. In the next sections, we present details about the proposed customer profiling approach reflecting e-commerce website features, customer's behavior and feedback, then we do mining on that dataset. And getting result in the form of Tmall website's customer. Data are analyzed through a fuzzy approach to classify customers using a new decision tree algorithm by extending existing approaches [23], [24], [25], [26], [27] to profile generation, taking into account new set of evaluation variables required from customer's feedback over the e-commerce website.
The classification task is characterized by well-defined classes and a training set of pre classified examples. Where the data to be classified is static, a simple classifier might suffice. However, our proposed approach requires classifier that is resilient to dynamic data such us time-related and click stream data taking online customer satisfaction level as predefined classes. The major innovation of the proposed approach is based on decision tree induction to obtain useful knowledge from large amounts of e-customers data to construct online customer profiles. The purpose of our study is to fill this gap, by conducting a two phases of training and inference. Training phase involves inducing a decision tree ensemble from customer reviews and click stream data. Inference phase uses the induced decision tree ensemble from training phase to classify customer implicitly based on their click stream as test data.
After providing an overview which treats together two different branches of research: online customer profiling techniques and features selection based on customer's shopping steps for load Profiles, details regarding our proposed approach are provided in Section 2, followed by a case study in Section 3 conducted on a real user action data provided by Tmall, one of the largest B2C online retail platforms in China. The paper concludes in Section 4 with a summary of the work undertaken and direction of future works.

PROPOSED LOAD PROFILING METHODOLOGY 2.1. Framework for proposed model
The proposed algorithm for customer profile generation has the following steps, Figure 1: Step1: Collect web data (click stream and online reviews data) from e-commerce website.
Step2: Apply preprocessing techniques to remove noise from data.
Step3: Divide the data into 2 parts: training data (Online Customer Reviews and Click Stream Data) and testing data (Click Stream Data). Step4: Apply appropriate data mining techniques to create the model.
Step5: Train the model as per given rule-based classifier.
Step6: If the model is not trained, go to step5.
Step7: Test the model using test data. Step8: If model is accepted, use it for online customer profiling.
Figure1. An overview of the proposed methodology

Metadata gathering and processing
User Profiling can be defined as the process of identifying data about a user interest domain. This information can be used by the system to enhance the retrieval of meeting the user's needs. This part presents the dynamic of the proposed model using a communication diagram showing the interactions between the various components of the system. The goal is to determine graphs of data that will be subject to appropriate optimization algorithm based on customer behavior targeting.

Capturing dynamic customer's behavior
In Figure

Candidate entries variables
The starting and critical point for successful customer profiling based on navigational data is data preprocessing. The required high-level tasks are data cleaning, user identification, session identification, product view identification, etc. The proposed data preprocessing model for customer analysis tracks various online activities of customers to construct logical user sessions and create relevant entries variables for our proposed e-customer profiling system which take into account new set of variables; categorical, continuous and fuzzy variables for evaluating online service quality.
After the customer's order is shipped, we detect its feedback information to assign values to Order Fulfillment as a categorical variable and Delivery Time which is numeric one. According to the Figure 2, we have three other variables: Efficiency, Content and Payment System Availability which are considered as fuzzy attributes as shown in Table 4.

Fuzzification
Restrictions placed on fuzzy variable's values to define the possible changes are based on e-customer behavior during navigation sessions:S = S = ∪ ∪ ∪ respectively, are visit sessions, consultation, basket session and purchase session [28].
The value of -1 indicates that no consultation/basket placement/Purchase was made after the visit; efficiency is therefore "Very Weak". The value of 1 indicates that there was a consultation/basket placement/Purchase, so the website's efficiency is considered as "Very Strong". For the Content variable, the value of 1 indicates that there was a basket placement after the consultation of product's information. The value of -1 indicates that there is no basket placement: The Payment of a command is very weak if no purchase was made or product(s) added to the online basket are deleted. The value of 1 indicates that there was a successful purchase: Membership function (MF) is defined for finding membership value for each of the fuzzy inputs to five levels: very weak, weak, medium, strong and very strong. Figure 3 shows membership function that we use in our method:

Proposed data mining approach
The proposed approach consists of two phases of training and inference as outlined in Figure 4. In training phase, online customer's reviews and click stream data are processed and a decision tree ensemble derived for change mining. The inference phase involves using the classifier derived from the training phase to dynamic profiling of online customers. The decision tree generation algorithm mentioned above present our proposed decision tree feature selection strategy adapted to the evolving data.

Learning phase
Given training set T containing n samples belonging to k classes {c1,c2,...,ck} (1≤ k≤ p). Let be the attribute that may influence consumer's satisfaction, its value set , … , … , , v is the number of values. The dissimilarity dis(x , x ) between any two A values for the class can be computed by: is a euclidean distance where x and x represent the vector of belonging to fuzzy objects , … , … , which are represented by functions mapping the fuzzy attribute scale (Figure 3), x = x , x , … x The average similarity of A for the class can be computed by equation expressed by A C k : The global similarity of the attribute is the sum of the similarity of each subset expressed by SIM( ), which can be defined as: The Set of class values is {C , C ,..., C }, their probability is{P , P ,..., P }.

Classification of a new point
To classify a new input point, we simply traverse down the tree. At each node in the decision tree, we ask a question about our data point. For categorical and numeric attributes, if the condition at the node is met, we go left; if it is not true, then we go right. For fuzzy criteria we calculate the value of d(x , x ) translating the membership possibility of the new input point to each fuzzy subspace in the tree. The best mark d(x , x ) in a subspace is the mark with the minimum value; it identifies the fuzzy split value which responds the condition at the node in question.

Dataset
We choose reviews and click stream data from TMALL as the data source. TMALL is an important business unit of Alibaba Group which is known as the top one B2C platform in China. The user's behavior of browsing TMALL reflects their preference of items. This data set contains 25432915908 records of user-item interactions. Features of each row are listed in Table 1. Table 2 presents review data for partial "user-item" pairs, which contains the review and rating on the item/merchant/logistic. This data set contains 241919749 rows, corresponding to 241919749 reviews. Table 4 describes used training data where class is deduced from online reviews which are filtered by 5 starts referring to customer satisfaction rate as mentioned in Table 3. For each class of each customer, restrictions can be placed on website's features values to define their possible changes based on e-customer behavior during navigation sessions as mentioned in Fuzzy Rules Design section. These features will be classified into 5 linguistic terms using click stream data into membership function values mentioned in Figure 3 e.g. (0.0/VW Very Weak+0.0/W Weak+0.5M/Medium+0.5/S Strong+0/VS Very Strong). Customers who don't describe their feedback were chosen as testing set a shown in Table 5. Vtime A string as "u9774184", denoting an unique user An integer in [1,8133507], denoting an unique item Type of behavior, a string like "click", "collect", "cart", "alipay", represents for ' click' , ' add to favorite' , ' add to cart' and ' purchase' , respectively Timestamp of user's behavior Table 2. User-Reviews Item Definition Used_id Item_id Feedback Rate_pic_url Gmt_create A string as "u9774184", denoting an unique user An integer in [1,8133507], denoting an unique item A string containing multiple key words, separated by ' ' . There words are extracted from the raw title by an NLP system An URL linked to corresponding image online Timestamp of the review, A string as "yyyy-mm-dd hh:mm:ss"

Programming setup
The deployment procedure is performed on a system among Intel Core i7, 8GB memory, along with Windows 7 system. Here, the method implemented in JAVA using Eclipse neon.3 with CSV Files. The Proposed algorithm is calculated with numerous kinds of customer's dataset to estimate the effectiveness of proposed approach.We ran online TMALL customer's reviews and click stream data through step 1 and step 2 of the process outlined in Figure 4. Data shown in Table 4 were randomly chosen as training set. This data set contains rows corresponding to records of user-item interactions and reviews, considering customer satisfaction degree as target attribute and TMALL website's characteristics as non-target attributes.
The simulation results are summarized in Figure 6. The algorithm takes as output a tree that resembles to an orientation diagram where each end node (leaf) is a decision (a class) and each non-final node (internal) represents a test. At each node in the decision tree, we ask a question about our in store navigational data point considered as input data as shown Table 5.

Performance and discussion
Our methodology enhances e-commerce website's feature extraction and customer's opinion classification. As the results show in Figure 6, the tests carried on numeric, categorical and fuzzy features show that the proposed algorithm select appropriate threshold for stopping growth respecting the three types of features and gives good classification rates for the shortest computing times. Feature selection reduces the dimensions of problem, but also improves customer's classification performance by discarding noise, redundant, and unimportant features.
First, references such us [29] were analyzed for click stream data usage, they didn't involve fuzzification of those data based on online customer navigation sessions. Moreover in the area of value creation in an e-business model, such an approach also didn't attempt to create value for customers based on their navigational pathways. Second, comparing reference [30] selection features of success in e-business, our research fuzzy mine click stream and use it as a key to select and judge features properly characterizing the global online store performance. Third, several references like [19] conducted research for online products which are relevant in terms of having high sales, while this paper research for ensure online customer's satisfaction. Using the proposed tree induction technique, marketing rules can be generated to match customer to satisfaction categories. For [31], classification decision tree algorithm has an input training dataset which consists of a number of attributes which are either categorical or continuous. Dataset used in the current paper is from the public information in Tmall. More types of data are available in our approach for learning decision tree classifier which can handle categorical, discrete, continuous and fuzzy attributes.

CONCLUSIONS AND FUTUR WORK
Modeling the user's interests is a challenging task. Our work draw a line between the actual user interest and the acquired user profile by inducing a decision tree ensemble from customer's behavior and 611 classifying customers based on their navigation sessions and reviews over the online service that may be used. The case study shows that by fitting the model variables and variable's restrictions, and by using our classification model, an accurate customer profile can be achieved with regards to assessment of e-service quality. As future works, our approach can be could be enhanced and extended in several directions: a. Whereas explicit customer's click data is collected for statistical analysis, data for textual customer's reviews need a pre-processing process. We can attempt to explore how text mining could be applied to mine and summarize customer's reviews. This deserves more study. b. Compare the proposed customer profiling approach to our proposed multi criteria classification model [32] to show promising results for classification techniques adoption. c. People tend to take experienced customer's opinion before making their own purchase. Thus, an implicit user profiling through social information discovery can be proposed to adequately capture the user's interests.