Vertical intent prediction approach based on Doc2vec and convolutional neural networks for improving vertical selection in aggregated search

Received Jul 18, 2019 Revised Jan 13, 2020 Accepted Feb 1, 2020 Vertical selection is the task of selecting the most relevant verticals to a given query in order to improve the diversity and quality of web search results. This task requires not only predicting relevant verticals but also these verticals must be those the user expects to be relevant for his particular information need. Most existing works focused on using traditional machine learning techniques to combine multiple types of features for selecting several relevant verticals. Although these techniques are very efficient, handling vertical selection with high accuracy is still a challenging research task. In this paper, we propose an approach for improving vertical selection in order to satisfy the user vertical intent and reduce user’s browsing time and efforts. First, it generates query embeddings vectors using the doc2vec algorithm that preserves syntactic and semantic information within each query. Secondly, this vector will be used as input to a convolutional neural network model for increasing the representation of the query with multiple levels of abstraction including rich semantic information and then creating a global summarization of the query features. We demonstrate the effectiveness of our approach through comprehensive experimentation using various datasets. Our experimental findings show that our system achieves significant accuracy. Further, it realizes accurate predictions on new unseen data.


INTRODUCTION
One of the most significant developments online in the last few years is the rising popularity of aggregated search (AS) systems, they present a most popular web search presentation paradigm used by major search engines in recent years, this technique consists of integrating search results from a variety of diverse verticals such as news, images, videos, health, and Wikipedia into a single interface with general web search. As shown in Figure 1 [1], research in aggregated search has taken two main directions. The first direction studies different methods used for predicting which verticals to present known as vertical selection (VS), another direction involves techniques that analyses the way of presenting these verticals in the Web results known as vertical presentation (VP), the research problem investigated in this paper focuses on the first one.
Vertical selection task consists of selecting a subset of the most relevant verticals to a given user information need and improves the search effectiveness while reducing the load of querying a large set of multiple verticals. The main goal behind this task is to help the user to satisfy his information needs. Identifying the intent behind the user query is a crucial step toward reaching this goal. In a broad sense, the automatic prediction of user intentions helps in enhancing the user experience by returning more relevant results to users and adapting these results to their specific needs. Thus, vertical selection is associated with two main challenges that are: the diversity of the verticals and the understanding of the user intent. Figure 1. Aggregated search process [1] Regarding the first challenge, a variety of heterogeneous vertical search engines exist in the web, which means that each vertical has its own features, for example, some verticals are not directly searchable by users like the weather vertical, thus features generated from the vertical query-log will not be available for this kind of verticals. Also, features generated from vertical corpus will not be available for verticals such as the calculator and language translation [2]. Therefore, researchers must deal with the fact that different verticals may require different feature representations when creating new vertical selection approaches.
The second challenge is related to the user intent; we know that a vertical selection system focuses on retrieving relevant verticals for showing its results to the user. For example, if the user searches images and news verticals, he specifically needs the results of these verticals, he isn't interested in the content of the other verticals such as shopping or weather, even if their content is relevant because the goal is to not only have relevant results but also satisfy the user intent.
Existing research papers to date have studied the problem of vertical selection from different ways [3], some prior work focused on constructing models that aim at detecting queries with content-specific vertical such as shopping [4], news [5], jobs [4] or Question Answering [6]. Other works for vertical selection [7][8][9][10] focused on using traditional machine learning techniques to combine multiple types of features for selecting several relevant verticals. Although these techniques are very efficient, handling vertical selection with high accuracy is still a challenging research task.
In this paper, we are interested in textual queries; image queries tend to have image results. Our proposed approach for predicting vertical intent consists of generating query embeddings vectors using doc2vec algorithm, that can accurately preserve syntactic and semantic information within each query, therefore we propose to use it as a primary query representation in our vertical selection model pipeline; then it will be used as input to a convolutional neural network (CNN) model that can increase this representation with multiple levels of abstraction comprising rich semantic information and creating a global summarization of the query features. To the best of our knowledge, this is the first time when the benefits of deep learning and paragraph vectors are exploited in the context of vertical selection, which can achieve an amazing progression and development in this area.
The remainder of this paper is structured as follows. In the next section, we review the related work concerning vertical selection. In Section 3, we provide a description of our proposed method. Section 4 is devoted to the experimental settings. We present and discuss the experimental results in Section 5. Finally, we conclude the study and discuss the future issue.  [6][7][8][9], in Vertical features (features depend only on the vertical) [2,5,21,22] and in Vertical-Query features (features aim to measure relationships between the vertical and the query, and are therefore unique to the vertical-query pair) [2,7,[22][23][24]. Indeed, among all these works, there are those that integrate the content from a single vertical [3]. In this respect, Li et al. [4] address the general problem of vertical selection using an approach that focuses on shopping and job verticals to extract implicit feedback using semi-supervised learning based on clickthrough data. Diaz [5] investigated also the vertical selection problem with respect to the news vertical, where he derived features from news collection, web and vertical query-logs and incorporated click-feedback into the model. More recent work in [6] has also targeted a variant of the vertical selection problem, where the author use Community Question Answering (CQA) Verticals for detecting queries with CQA intent.
Other approaches have been developed where several verticals are considered simultaneously [3]. Arguello et al. [7] propose a classification-based approach for vertical selection in which they exploit features from the vertical content, the query string, and the vertical's query log. The click-through data is used to construct a descriptive language model for each vertical's related queries. Diaz and Arguello [9] also present several algorithms for combining user feedback with offline classifier information, the focus of their work was to maximize user satisfaction by presenting the appropriate vertical display. Another work from Arguello et al [8] have been proposed where the goal was to use training data associated with a set of existing verticals in order to learn a model that can make vertical selection predictions for a target vertical. Recent work in the same context was proposed from [10], in which the desired vertical of the user is placed on the top of the web result page. This is achieved by predicting verticals based on the user' s past behavior.
Regarding the second factor, there are a few works addressing the vertical intent issue when studying query intent detection. From the user intent perspective, user vertical intent also plays an important role in the improvement of the aggregated search process. For example, Zhou et al. [25] propose a methodology to predict the vertical intent of a query using a search engine log by exploiting click-through data. Recent work of Tsur et al. [6] present a supervised classification scheme where they aim at detecting queries with question intent as a variant of the vertical selection problem. They introduced two classification schemes that consider query structure. In the first approach, they induce features from the query structure as an input to supervised linear classification. In the second approach, word clusters and their positions in the query are used as input to a random forest classifier to identify discriminative structural elements in the query.
Despite its interesting role, the research in this direction is still limited, and there is no huge literature regarding vertical selection based on user vertical intent, especially with the evolution shown in IR (Information Retrieval) during the last years. Therefore, we will focus in this work on the problem of vertical intent prediction, where we propose a new approach that combines the doc2vec algorithms and convolutional neural networks and exploits for the first time the benefits of both techniques in order to improve vertical selection task.

PROPOSED APPROACH
Through the vertical selection process, the query is processed and sent to multiple verticals as well as the Web search engine, in order to decide which of those should be selected for a given query, this depends on what are the verticals intended to be retrieved by the user, we refer to this as the user vertical intent (VI) and it can be defined as follows: Given a user's query ( ) and a set of candidate verticals = { 1 , 2 , … , }, the vertical intent is represented by the vector = { 1 , 2 , … , }, where each value indicates the importance of the given vertical to the query , and for each vertical , given a threshold , above which the vertical is assumed to have a high intent for the query: if > then we can say that the vertical is intended by the query . To address this problem, we propose an approach based on Doc2vec and Convolutional Neural Networks. First, it generates a query embeddings vector using the doc2vec algorithm that preserves syntactic and semantic information within each query. Secondly, this vector will be used as input to a Convolutional Neural Network model for increasing the representation of the query with multiple levels of abstraction including rich semantic information and then creating a global summarization of the query features. Figure 2 shows the overall architecture of our proposed vertical selection system. It contains two main parts, the semantic representation of the query and the query level feature extraction. Subsections bellow describe these two parts and all performed steps in depth.

Semantic representation of the query
The core algorithm of this step is doc2vec which is an unsupervised model that is used most to construct distributed representations of arbitrarily long sentences. It is an extension of word2vec that learns fixed-length feature representations for variable-length pieces of texts such as sentences, paragraphs, and documents [26]. A Doc2vec or paragraph vectors has two different architectures: The Distributed Bag-of-Words model and the Distributed Memory model. The Distributed Bag-of-Words (DBOW) model trains faster and does not consider word order; it predicts a random group of words in a paragraph based on the provided paragraph vector. In the Distributed Memory (DM) model, the paragraph is treated as an extra word, which is then averaged with the local relevant word vectors for making predictions. This method, however, acquires additional calculation but can achieve better results than DBOW. We chose Doc2Vec model because it overcomes the disadvantages of the other bag-of-words models by learning semantic relationships between words, this is why it has been widely used recently in various NLP tasks [27][28][29] and Information Retrieval works [30][31][32][33][34][35] where it has proven that it is able to capture the semantics of paragraphs which leads to excellent results. In this step, the goal is to learn a good semantic representation of the input query, following that idea; we tried to represent each query as vector and used this vector as features for our classification model as presented in Figure 2. Therefore, Doc2vec algorithm contributes effectively for improving the performance of our system.

Query level feature extraction
Unlike traditional approaches that require handcrafted features to predict relevant verticals, our approach consists of using a Convolutional Neural Network to extract the most important semantic features that represent each query and delete those that are unnecessary. Recently, CNNs have achieved promising performances in various NLP tasks, such as Information Extraction [36], Summarization [37], Machine Translation [38], Classification [39], Question Answering [40] and other traditional NLP tasks [41][42][43]. CNN architecture used in this paper is shown in Figure 3.
-Input Layer: First of all, once the query embeddings vectors are generated from the previous step, we use them to construct the input matrix needed in the embedding layer where each query is being represented as a 2-dimensional matrix. -Convolutional layer: The primary purpose of this layer is to capture the syntactic and semantic features of the entire query and compress these valuable semantics into feature maps. Thus, we perform several convolutions over the embedded query vectors using multiple filters with different window size.
As the filter moves on, various features are produced and combined into a feature map. Activation functions are added to incorporate element-wise non-linearity. In this layer, the goal is to extract the most relevant features within each feature map, therefore, we use the max-pooling strategy for the pooling operation. Since there are multiple feature maps, we have a vector after each pooling operation. All vectors that are obtained from the max-pooling layer are concatenated into a fixed-length feature vector. -Fully connected layers: these layers constitute the classification part of our model; their main purpose is to use high-level features obtained from the previous layers and passed them to the final softmax layer for classifying the input query into various classes based on its vertical intents scores. In addition to these different layers, there are some optimizations that we have performed in order to reduce the overfitting and obtain better test accuracy. These optimizations include applying an l2 norm constraint of the weight vectors in the convolutions and fully connected layers as well as adding several Batch Normalization layers, that normalize the activations of the previous layer for each batch during training, which helps to train our model faster and consequently improve our model's performance.

. EXPERIMENTS
This section describes the datasets used and the various hyperparameters chosen for evaluating the performance of our system, it also gives details about how the doc2vec and CNN were trained.

Datasets
In these experiments, we employ three public datasets for the training, validation and testing our proposed system respectively, a summary statistic of these datasets is listed in Table 1. We describe each dataset in detail below:

3875
In the first dataset, we use the official NTCIR-12 IMine-2 vertical intent collection for English Subtopics [44] which was designed to explore and evaluate the technologies of understanding user intents behind the query. IMINE-2 includes a set of 100 topics (i.e. queries). Each topic is labeled by a set of intents with probabilities, and there is a set of subtopics as relevance judgment for each vertical intent. A subtopic of a given query is viewed as a search intent that specializes and/or disambiguates the original query, the number of these subtopics is 533. The general idea is to first generate a more complete representation of the different possible intents associated with the input query, and then to perform vertical selection for each intent separately.
In addition, this dataset includes five types of queries, namely "ambiguous", "faceted", "very clear", "task-oriented", and "vertical-oriented", this allows us to investigate the performances of our system with diverse queries and varied topics. The details of the five query types are as follows [44]: a. Ambiguous: The concepts/objects behind the query are ambiguous (e.g., "Jaguar" -> car, animal, etc.). b. Faceted: The information needs behind the query include many facets or aspects (e.g., "harry potter" -> movie, book, Wikipedia, etc.). c. Very clear: The information need behind the query is very clear so that usually a single relevant document can satisfy his information needs. (e.g., "apple.com homepage") d. Task-oriented: The search intent behind the query relates the searcher's goal (e.g., "lose weight" -> exercise, healthy food, medicine, etc.). e. Vertical-oriented: The search intent behind the query strongly indicates a specific vertical (e.g., "iPhone photo" -> Image vertical). We use this dataset as training data from which we construct a validation set by selecting 10% randomly. The second dataset used in this paper is FedWeb'14 [45] that is used in the TREC FedWeb track 2014, the collection contains search result pages from 108 web search engines (e.g. Google, Yahoo!, YouTube and Wikipedia). For each engine, 75 test topics were provided, from which 50 will be used for vertical selection evaluations. We conduct a prediction task for these 50 test queries and compare the result obtained with the real values in the dataset.
The last dataset is TREC 2009 Million Query Track [46], this collection contains 40000 queries that were sampled from two large query logs. To help anonymize and select queries with relatively high volume, they were processed by a filter that converted them into queries with roughly equal frequency in a third query log. This dataset is used as training data for doc2vec algorithm to improve its performance with large data. Moreover, the different queries used in these experiments have different lengths, from short to long queries as shown in Figures 4, 5 and 6. The structure of our model allows us to learn both kinds.  [45] It is designed to resource selection, results merging and vertical selection tasks.

9 Testing
TREC Million Query Track 2009 [46] It is an exploration of ad hoc retrieval over a large set of queries and a large collection of documents, and it investigates questions of system evaluation.

Hyperparameters and training
Firstly, in order to train and evaluate our proposed system, we need to adjust several hyperparameters. Indeed, regarding the first part of our architecture, to use doc2vec for our datasets, we first trained doc2vec model (Python gensim library implementation) on TREC 2009 Million Query Track datasets using Distributed Memory model. Then we transformed all the queries on both training and testing sets to Doc2Vec vectors. The various parameters used for training doc2vec model are shown in Table 2.
In the second part, the implementation of our network is made using Keras framework with TensorFlow backend. For the input layer, we used Gensim's doc2vec embeddings and created input data from it, instead of using keras embedding layer. To build our CNN architecture there are many hyperparameters to choose from. Therefore, we first consider the performance of a baseline CNN configuration using the parameters described in Table 3.  Then, we evaluated the effect of each of the other parameters by holding all other settings constant and vary only the factor of interest. During these experiments, we chose to use ReLU activation function and max-pooling strategy for our CNN model with mean squared error loss function, these parameters are mostly used in CNN architectures and gives good performances. Finally, we combined all the good variation results obtained from these experiments, and used them for our suggested CNN model.

RESULTS AND DISCUSSION
In this section, we present and discuss our experimental results obtained by each part of our vertical selection system.

Semantic representation of the query using Doc2vec
Starting with the first part of our proposed system, once doc2vec model is trained, we use it to generate an embedding vector for each query in our collections (train set and validation set + test set). These embedding vectors must capture semantic meanings of each of these queries. In this respect, to make sure that this model achieved this goal, we used our doc2vec model to find the most similar queries for two sample queries in our dataset, the resulting queries, and corresponding scores are presented in Table 4. As we can see in this table, the similarity scores between the first queries and the second ones are significant. We figure out that this doc2vec model is a meaningful model and can successfully recognize semantic information among queries.

Query level feature extraction using CNN
As we have evaluated the quality of the query vectors obtained in the previous part, we proceed now to the evaluation of our model architecture. First of all, we start by assessing the impact of various parameters used in order to configure our CNN model and obtain the best possible results.

Impact of filter sizes
We explored the effect of various filter sizes, while keeping the number of filters for each region size fixed at 100. Figure 7 shows that the plot for the filter size of [2][3][4] was at the top of all the other plots throughout the run, and it yielded better accuracy (69.94%) compared to filter sizes [3][4][5] and [4][5][6] (66.39% and 67.22% respectively) as reported in Table 5.

Impact of feature maps
The variations of the number of filters per filter sizes don't help much as shown in Figure 8, but still there are a few noticeable accuracy results when the number of filters is 200 (68.48%) as shown in Table 6.

Impact of regularization
We have used two common regularization strategies for our CNN, that are dropout and l2 norm constraints. We explore the effect of these two strategies here. we presented the effect of the l2 norm imposed on the weight vectors in Figure 9 and Table 7. We then experimented with varying the dropout rate from 0.1 to 0.5 as shown in in Figure 10 and Table 8, fixing the l2 norm constraint to 0.01. The variations in regularization show that for the l2 norm constraint, the classification performance is higher with value=0 which produce best results compared to higher values that often hurts performance as shown in Figure 9 and Table 7. Separately, we considered the effect of dropout rate. Figure 10 indicate that when setting the dropout rate to 0.2 it gives the best accuracy results (69.73%) and this value decrease when we increase the dropout rate as reported in Table 8.

Impact of batch size
We next investigated the effect of the Batch size. Figure 11 and Table 9 show that varying the batch size also helps little, and the best accuracy result obtained was 67.01% with a batch size of value 60.

Impact of optimizers
Regarding the optimizers, we can see clearly from Figure 12 that only the curves of "adam" and "adadelta" optimizers were at the top. Thus, they produce best accuracy results (66.81% and 65.97%) compared to "sgd" (56.16%) as shown in Table 10. Next, we exploited the observations given above to configure our CNN model with the best variations of parameters. We summarize the suggested parameters in Table 11.  Now, we can evaluate our model with the suggested values. In this respect, we choose to assess the CNN architecture using the accuracy and the Mean Square Error (MSE) loss, that represent the most important and intuitive metrics for classification performance. They are defined as follows: where ( −̂) 2 present the square of the difference between the actual and the predicted result. Figures 13  and 14 show respectively the values of the loss and the accuracy obtained on both training and validation sets. After training for 100 epochs we have obtained the results shown in Table 12. These results show that an accuracy up to 80% seems to be the best value that can be obtained in this experiment.  The difference between the results obtained from the baseline model and the suggested model show clearly the impact of the suggested values chosen. Since there is a noticeable increase in the accuracy value between the two models. Moreover, Figure 14 shows that the classification accuracy is varied, however, after 20 epochs, we obtain the higher value. There is also a difference in accuracy between training and validation sets.
In the other hand, by using mean square error (MSE) as a classification loss, we optimize by reducing the average squared distance between our predictions and the true values. In Figure 13, we can see that the validation loss is synchronized with the training loss, and the validation loss is decreasing. All these observations prove that our system achieves significant results (accuracy and loss) which demonstrate its performance even if our data isn't that big. In most cases, the size of the data is not very important, what is important is the variations in the samples we have, which is the case of our training dataset that includes a variety of samples.

The effectiveness of our vertical intent prediction system 5.3.1. Predictions on unseen data
To complete our experiments, we evaluate the performance of our model on the prediction task of vertical intents on new unseen data. Now that we have trained our model and evaluated its accuracy on the validation set, we can use it to make predictions, this step is done using the test dataset (FedWeb 14) that has not been used for any training or validation operations. This is an estimate of the efficiency of the algorithm trained on the problem when making predictions on unseen data.
As mentioned previously, we have already generated query vectors for the test dataset using doc2vec in the first part of this experimentation. Thus, we have fed these vectors to our CNN model for predictions, which allows us to obtain a vertical intent score for each query. Evaluating the quality of these predictions using mean squared error gives us 1.31% which prove that our model achieves accurate predictions.

CONCLUSION AND FUTURE WORK
In this paper, we proposed a vertical intent prediction architecture to improve the vertical selection task. This approach outperforms the traditional vertical selection approaches by the fact that it considers satisfying the user vertical intent automatically and without any feature engineering, unlike the others that addressed this problem by training a machine-learned model based on the features that are supposed to detect the vertical relevance to the query, which is very expensive and takes more time. Our experimental findings show that our model can achieve acceptable accuracy. The research can be continued in future work in order to increase accuracy. In addition, this work represents an advancement in the context of vertical selection, which will renew the research in this field where there is no huge literature.