Benchmarking data mining approaches for traveler segmentation

Received Aug 7, 2018 Revised Jun 16, 2020 Accepted Jun 29, 2020 The purpose of this study is proposing a hybrid data mining solution for traveler segmentation in tourism domain which can be used for planning user-oriented trips, arranging travel campaigns or similar services. Data set used in this work have been provided by a travel agency which contains flight and hotel bookings of travelers. Initially, the data set was prepared for running data mining algorithms. Then, various machine learning algorithms were benchmarked for performing accurate traveler segmentation and prediction tasks. Fuzzy C-means and X-means algorithms were applied for clustering user data. J48 and multilayer perceptron (MLP) algorithms were applied for classifying instances based on segmented user data. According to the findings of this study, J48 has the most effective classification results when applied on the data set which is clustered with X-means algorithm. The proposed hybrid data mining solution can be used by travel agencies to plan trip campaigns for similar travelers.


INTRODUCTION
Data mining is a technique for extracting knowledge from large data sets. It is the combination of statistical and mathematical methods for processing raw data to discover knowledge [1]. Today, data mining methods are used in many topics such as filtering systems, risk analysis management, fraud detection, medicine, e-commerce and many more [2][3][4]. Tourism domain is one of these areas where different types of data mining solutions can be applied. Tour planners, travel schedule planners and social media-based trip recommenders are examples of possible data mining related works [5].
Tour planners suggest possible visit locations to its users. Location based collaborative filtering approach can be used for this purpose [6]. I. García-Magariño [7] proposed an agent-based tour simulator whereas A. Varfolomeyev, et al. [8] focused on generating recommendations for historical tourism planning, and R. Colomo-Palacios, et al. [9] proposed a context-aware recommender system for mobile devices. Travel schedule planners help its users to build time tables for visit locations by taking time and related constraints into account. F. M. Hsu, et al. [10] combined Engel-Blackwell-Miniard model with Bayesian network. A. Moreno, et al. [11] proposed an ontology-based recommendation system. [12] combined trip planning and scheduling. Content-based filtering and hotel service recommenders are proposed by [13,14].
Social media-based trip recommenders propose items based on information retrieved from sources like geo-tagged photos of travelers [15]. Y. Sun, et al. [16] Used geo-tagged image data for road-based recommendations. Extracted trip behaviors of users from geo-tagged photos [17]. Identified tourist hot spots  [18]. A. Majid and J. Han [19,20] Proposed similar approaches for obtaining and personalizing travel locations. To carry out these tasks a reliable data mining framework is required [21][22][23]].
An expert system which is designed for the tasks above mostly relies on a recommender engine. Basically, a recommender engine tries to propose similar items to a target user or user group [24]. To achieve this goal, system tries to generate a rating value. Possible items are matched with users based on the generated rating score. Rating score computation can be carried out in different ways. Most systems use target user's profile and previous user behavior data for this task.
In this study, selected data mining methods were tested and benchmarked on a traveler data set to propose a possible hybrid data mining approach for getting accurate travel recommendations. Neural network-based and tree-based data mining methods were combined with Fuzzy C-means and X-means clustering algorithms to assess the data mining model pair which generates the highest prediction correctness. The rest of the paper is organized as follows: Section 2 describes the data gathering process and includes a background of the data mining algorithms used in this study. Section 3 presents the obtained results and Section 4 contains conclusion of this study.

RESEARCH METHOD
Different types of machine learning algorithms were tested on a real-world traveler data set which contains information about flight and hotel booking transactions of travelers. Detailed definition about this data set and applied algorithms are defined in the following subsections.

Data gathering and processing
The raw data set was retrieved from a travel agency. It included transactions of 26,886 flight bookings and 4,367 hotel bookings. After finding flight and hotel records that customers booked for the same trip, 317 matching records were collected. Removing the identity columns from this data set yielded 14 attributes. Table 1 lists these initial data set attributes and their descriptions.
Further data set analysis revealed that "departure location" and "returning location (to)" attributes were containing the same set of values. Because of this fact, "departure location" was removed from data set. Values of "returning location (from/to)" attributes were discretized according to regions. Table 2 lists possible regions and their numeric codes. Values for the flight and hotel cost attributes were discretized into six groups according to customers' expenses. Table 3 and Table 4 show discretized cost groups. "Departure date" and "returning date" attributes were used for computing each transaction's travel season and travel duration values. "Days in hotel" attribute was removed because its values were same with travel duration. And ticket class attributes were removed because 97% of ticket class values were from the same ticket type.
After deriving two new attributes (travel season, travel duration), removing redundant fields and discretizing data set, 10 attributes were collected and preprocessed for data mining algorithms. Table 5 lists the final data set attributes and their descriptions. The final data set was used for training and testing data mining models. 66% of data was used for training and 34% was used for testing models.

Clustering and classification algorithms
Various clustering and classification algorithms were executed to build prediction models using the described traveler data set. Brief descriptions of these approaches are listed below: a) Multilayer perceptron (MLP): MLP is a classification algorithm based on feed-forward artifical neural network models. It employs backpropogation for training the network [1]. b) J48: J48 is the Java implementation of C4.5 decision tree algorithm which is based on ID3. Information entropy is used by this approach while constructing the decision tree model [1,25]. c) Fuzzy C-means clustering (FCM): FCM is a soft clustering algorithm. Unlike hard clustering methods, each point in a data set has a degree of belonging to clusters [26,27]. d) X-means clustering (XM): XM can be summarized as an improved version of the K-means clustering algorithm it provides self-estimation of the number of clusters for a given data set [28]. Correctness is the percentage of correctly classified instances. Root mean squared error (RMSE) is a metric which is computed for assessing differences between actual and predicted instances.

Model training
WEKA [29] and MATLAB [30] tools were used for running the clustering and classification algorithms. Comparison metrics which are described above are computed for each prediction model and obtained results are discussed in the next section.

RESULTS AND ANALYSIS
Final version of the traveler data set was segmented into four to eight clusters using X-means and Fuzzy C-means clustering algorithms. This process yielded ten differently segmented versions for the same data set. Using each differently segmented set, J48 and MLP prediction models were generated. The most accurate model among these models can be used for classifying the corresponding segmentation of a new traveler instance in an accurate way. Table 6 lists recall, specificity, precision, correctness and RMSE values of each prediction model. According to the obtained experimental results shown in Table 6, J48 has the best correctness, precision and recall scores when it is applied on the data set clustered into five clusters using X-means algorithm. MLP generates the highest specificity and lowest RMSE values when it is applied on the data set clustered into eight clusters using X-means algorithm. The best score for each metric was obtained by the X-means clustering algorithm. Table 7 shows the decision tree paths for the J48 and X-means method combination which has the top correctness score.
According to the listed results in Table 7, J48 model generated 11 different tree paths. Each path can be mapped as a decision rule for a specific type of a customer. Based on the listed paths, characteristics of each cluster can be defined as follows: 1) Cluster 1 represents male or female passengers whose preferred returning location is within location codes from 1 to 14 and preferred returning airline is within company codes from 10 to 77. 2) Cluster 2 represents four different types of passengers: a. Male or female passengers whose preferred returning location is within location codes from 15 to 22 and hotel cost is above 1000 TL. b. Male passengers whose preferred returning location is within location codes from 15 to 22 and hotel cost is between 701 TL and 1000 TL. These passengers prefer travelling in summer or fall seasons. c. Male passengers whose preferred returning location is within location codes from 15 to 22 and hotel cost is no more than 700 TL. d. Female passengers whose preferred returning location is within location codes from 15 to 22 and hotel cost is between 351 TL and 1000 TL. 3) Cluster 3 represents three different types of passengers: a. Male passengers whose preferred returning location is within location codes from 15 to 22 and hotel cost is between 701 TL and 1000 TL. These passengers prefer travelling in spring or winter seasons. b. Female passengers whose preferred returning location is within location codes from 15 to 22 and hotel cost is no more than 350 TL. c. Male or female passengers whose preferred returning location is within location codes from 1 to 14 and preferred returning airline is within company codes from 1 to 9 and preferred departure airline is any company other than the company with code 1 and hotel cost is no more than 700 TL. These passengers prefer travelling in summer or fall seasons. 4) Cluster 4 represents male or female passengers whose preferred returning location is within location codes from 1 to 14 and preferred airline is within company codes from 1 to 9 and hotel cost is above 700 TL. These passengers prefer travelling in summer or fall seasons. 5) Cluster 5 represents two different types of passengers: a. Male or female passengers whose preferred returning location is within location codes from 1 to 14 and preferred returning airline is within company codes from 1 to 9 and preferred departure airline is the company with code 1 and hotel cost is no more than 700 TL. These passengers prefer travelling in summer or fall seasons. b. Male or female passengers whose preferred returning location is within location codes from 1 to 14 and preferred returning airline is within company codes from 1 to 9. These passengers prefer travelling in winter or spring seasons. If "Returning location (from)" > 14 and "Hotel Cost" > 3 Then output is Cluster 2 Path 2 If "Returning location (from)" > 14 and "Hotel Cost" <= 3 and "Gender" > 0 and "Hotel Cost" > 2 and "Season" > 2 Then output is Cluster 2

CONCLUSION
This study presents and compares detailed model performances of different data mining algorithms executed on a real-world traveler data set. Based on the obtained results, J48 and X-means algorithm combination has the best prediction performance in terms of given classification metrics. This hybrid data mining method combination can be used for predicting possible trip destinations based on behaviors of similar users. The prediction result can support decision-making process of travel agencies while preparing campaigns. Alternatively, it can be a part of a travel system where possible trip opportunities can be proposed to similar users. Including more classification and clustering algorithms to this apprach can be modeled as a part of a future study. Dr. Adem Karahoca holds a PhD in Software Engineering. He is interested in human-computer interaction, web-based education systems, data mining, big data, and management information systems. He has published articles at prestigious journals about use and data mining applications of business information systems in health, tourism, and education.