Prediction of Answer Keywords using Char-RNN

Generating sequences of characters using a Recurrent Neural Network (RNN) is a tried and tested method for creating unique and context aware words, and is fundamental in Natural Language Processing tasks. These type of Neural Networks can also be used a question-answering system. The main drawback of most of these systems is that they work from a factoid database of information, and when queried about new and current information, the responses are usually bleak. In this paper, the author proposes a novel approach to ﬁnding answer keywords from a given body of news text or headline, based on the query that was asked, where the query would be of the nature of current affairs or recent news, with the use of Gated Recurrent Unit (GRU) variant of RNNs. Thus, this ensures that the answers provided are relevant to the content of query that was put forth.


INTRODUCTION
In the modern world the amount of information that is presented to the user, or even generated is quite overwhelming. Even though there are various applications that facilitate the consolidation of the information, there is a possibility that some information might be missed which may prove to be vital in other scenarios. One way to compress the information while having it retain contextual information is through the use of Recurrent Neural Networks (RNNs) RNNs are dynamic systems which are able to generate sequences of information, not only in textual domain (1), but also in various other domains, including but not limited to music (2), (3), image (4) among several others. RNNs work on the basic principle that, a certain data is provided to the system, which it processes and trains itself on, and after a certain number of iterations of training, it begins to generate sequences of text, or other information that is provided initially. In theory, this is considered to be good, but when practically using the system one of the main flaws of the system is exposed, which is that as the system generates text, it begins to lose context and thus, the information it subsequently generates is only based on the few previous words it has learnt, but not from the original content from which it started to train.
To overcome this fundamental flaw, the author uses a variant of the standard RNN called as GRU (Gated Recurrent Unit). When compared with a standard RNN, Long Short Term Memory (LSTM) based variant performs better since it is able to retain information, from the original context that it was trained on, and on the newer context that it learns, thus being able to generate a much more diverse and natural sequences of information. A GRU based RNN is considered to be the next step of evolution for LSTMs.
The closest implementation of this nature was attempted by the Facebook Research team (5). In their paper, they are using a large data corpus to predict answers for factual questions. The system proposed in this paper attempts to provide answers to questions whose answers are considered to be currently 'in news', and Journal Homepage: http://iaescore.com/journals/index. php/IJECE   IJECE   ISSN: 2088-8708  2165 may not necessarily have an encyclopaedic entry. Section 2. contains the Literature Review of the papers that are used as reference material for this paper. Section 3.1. consists of the GRU based RNN model that is used in this paper. Section 3.2. includes the proposed method, a model, and its evaluation. The results are recorded in section 4. Section 5. comprises the future scope and conclusion.

RELATED WORK 2.1. Smart Reply: Automated Response Suggestion for Email
Replying to an Email is sometimes a tricky situation, especially with choosing the right words to convey the right meaning which is somewhat a tedious task. In the past there have been many automated response systems built within email clients but most of those are just static sentences that wouldn't be flexible to a given scenario. They would just be simple sentences. In this paper, The authors of the paper, devise a system that would provide an automated response to a received email, based on the content of the email.
In this system, the authors create a sequence-to-sequence long short term memory network to predict the sequences of text. The input is an incoming message, and the output is a series of responses that are generated based on the provided text corpus. The LSTMs were first originally applied to Machine Translation but has since seen success in other domains such as image captioning and speech recognition. (6)

Improving Context Aware Language Models
In the current age, many automated textual systems use the LSTM model to generate text responses. The authors of this paper propose an alternative approach to the standard RNN used for text generation, which is Recurrent Neural Network Language Model (RNNLM). What they propose is, to replace the domains which are used to provide a sort of an inadequate context for the text generation, with context based variables. These context variables are used to describe certain aspects of the language, such as topic, time, or other language. These context variables are then dynamically combined to create to create a more coherent text. The combination of the context variables with context embedding creates the aforementioned RNNLM.
The data used to test the model is obtained from reddit, twitter and SCOTUS. On the reddit data, the testing is done based on a given comment, and the objective for the model is to identify the subreddit (discussion group of one topic), from which the sentence might have originated. There are eight specific subreddits, and nine general subreddits, and the results show that the RNNLM model is comparatively better on specific subreddits, compared to general subreddits, thus showing that context based language models were much more effective. (7)

Contextual LSTM (CLSTM) models for Large scale NLP tasks
Text presented in any format is usually in the form of phrases, sentences, paragraphs, sections among others. These formats are a way of abstracting sentences into a combined entity to present a cohesive meaning. In this paper, the authors create a Contextual LSTM (CLSTM) model, and it is tested against a normal LSTM model on Natural Language Processing tasks, such as word selection, next word prediction, and next sentence prediction. The data used on this model was from English Wikipedia, and English Google News site.
The results on the English Google News data, showed improvements across the board when comparing LSTM with CLSTM. In terms of next word prediction, LSTM using words as features had the perplexity of 37, and the CLSTM improvement was about 2 percent. In the case of next sentence selection, the improvement was about 39 percent for the LSTM, when compared with CLSTM, which had an accuracy of 46 percent and, finally next sentence topic prediction in LSTM, using current sentence topic as feature, the perplexity was about 5, and CLSTM improved on it by 9 percent. Thus one can see that the usage of context along with an LSTM has an improving effect over a standard LSTM, for natural language processing tasks. (8)

Context-aware Natural Language Generation with Recurrent Neural Networks
Natural Language generation is a useful system in various applications, such as response generation in messaging systems, text summarization and image captioning. Most of the natural language generation ISSN: 2088-8708 works only with the provided amount of content, while ignoring the contextual information that is present. However, in real life scenarios, these systems ignore the contextual component, and only focus on the current data to generate text. The authors of the paper, propose two approaches for this. The first is a C2S (context to sequences) model, which encodes a set of contexts into a continuous representation and then decode the representation into a text sequence, through a recurrent neural network. However not all the words may depend on the contexts, some of which may only depend on their preceding words. To resolve this, a gating mechanism is introduced to control when the information from the contexts are accessed. The second model that they propose, is called gC2S (Gated Contexts to Sequences).
The data used for this is customer reviews from sites such as Amazon and TripAdvisor (Travel website, where travelers can review places). In Amazon the reviews were focused from the book, electronic, and movie categories; and for TripAdvisor, the category was hotels. Upon testing, the authors found that, the gC2S model significantly outperforms the C2S model as it adds skip-connections between the context representations and the words in the sequences, allowing the information from the contexts to be able to directly affect the generation of words. (9)

Generating Sentences from a Continuous Space
The standard Recurrent Neural Network Language Model (RNNLM) generates a single word at a time and does not work from an explicit global sentence variational autoencoder generative model that incorporates distributed latent representations of entire sentences. This factorization allows it to explicitly model holistic properties of sentences such as style, topic, and high-level syntactic features. The RNNLM is a probabilistic model with no significant assumptions since it generates sentences word-by-word based on an evolving distributed state representation. However, when the RNNLM model is broken down into a series of next-step predictions, it does not expose interpretable representation of global features. To overcome such short comings, the authors of the papers propose a concept, which they refer to as variational autoencoder, through which they hope to capture the global features in a continuous latent variable.
A standard RNN predicts words based on conditioning of the previous word and an evolving hidden state. Although effective, it does not learn about the vector representation of the full sentence. In order to represent a continuous latent sentence, one should first find a method to map sentences and distributed representations that can be trained in an unsupervised setting. Sequence autoencoders are somewhat successful, in generating complete documents, an autoencoder consists of an encoder function and a probabilistic decoder model, in the case of a sequence autoencoder, both encoder and decoder are RNNs and examples are token sequences. The Variational Autoencoder (VAE) is a generative model that is based on a regularized version of the standard autoencoder. This model imposes a prior distribution which enforces a regular geometry and makes it possible to draw proper samples from the model using ancestral sampling.
The dataset used for this is a collection of text from K12 e-books, which contains approximately 80 million sentences. One of the test was to see the imputation of missing words and the authors observed that a standard RNN can only predict from a previous set of provided tokens where as the RNN with VAE is much more effective and can provide an intuitive based completion which fit the context much better and provided a more cohesive meaning to a sentence. (10)

A Neural Attention Model for Abstractive Sentence Summarization
Summarization is one of the most important components in the current world, it is especially required by the news outlets and discussion forums, among others to summarize contents that are otherwise too long to read by users who would not have the time to do so. The objective is to create a condensed form of the text corpus that is able to capture the whole meaning of the corpus. Most common form of summarization is extraction style, where certain parts of the text is extracted out and then stitched together. The other form is abstractive summarization where the approach is to do the summarization from the bottom up. The authors of this paper propose a new concept of summarization called as attention based summarization. It incorporates linguistic structure when compared with other approaches, and can scale easily to train on large amounts of data. Since the system makes no assumptions about the vocabulary of the generated summary, it can be trained directly on any document pairs, and thus can be used to summarize a given data into a headline.

IJECE ISSN: 2088-8708 2167
The dataset used by the authors was a collection of 500 news articles from The New York Times and Associated Press, paired with 4 different human generated reference summaries. The authors find that, the proposed model is able to perform comparatively much better than the older approaches and in a more abstractive summaries. (11)

Neural Turing Machines
The neural network is an excellent concept. It is able to mimic the human brain quite well. But one of the missing components is the memory part of it, the neural network is not able to remember the activity it has completed. It trains for a given task, and executes the task but once completed, it forgets the method, and if the said task is to be performed again, it would have to be trained yet again and then the task could be performed. The authors of this paper propose a concept in which the neural network (in this case, an LSTM) is paired with a memory bank. This neural network is able to interact with both standard input and output vectors, and also with a memory bank.
The authors train the model in performing two simple tasks; reading and writing, and then store the concept of the task into its memory banks. Then they perform a few tests, first being the copy test. The model is provided with random values of 8-bit binary values of random lengths, and then it is to output a value, that is a copy of the initial sequence. The authors find that after training both the LSTM model and an NTM model, the NTM model is able to perform subsequent tasks of copying in much lesser time, and it need not be trained again. Another test that the authors performed was the sorting. The model was provided with random binary values, along with priority which is between -1 and 1. They found that the NTM was able to understand the concept of sorting and was able to perform the task much more cohesively. It was able to approach this concept as a human mind would, compared to an LSTM model. (12) After examining various papers which are aforementioned, it was found that the system that the authors were looking for does not exist, and thus, through this paper the authors wants to propose such a system.

PROPOSED WORK 3.1. RNN Model
In this section, the authors describe the model that is being used in this paper. An example of this RNN Model is shown in Figure 1. The first thing to note is the input which is a vector This along with certain weights is passed as input, to a stack of recurrently connected and hidden layers, represented by A, to calculate the first hidden vector sequence of and the output vector sequence is This output vector y is parameterized, looped and redirected back into the network as a set of inputs. This looping feature is what provides the neural network, its key recurrence nature. To observe the actual nature of the RNN one can see that the 'unrolled' RNN is just chained normal neural network, which processes the data as required and passes the information to the next successor. Now, this can be considered as a single layer network. To make it a 'deep' RNN, one can stack them one on top of the other. Now, one of the main problems that occurs as the information traverses the layers is that, the components in the subsequent layers are learning faster compared to the previous layers. This problem is referred to as the 'Vanishing Gradient Problem' (13). Here, gradients are the values that refer to speed at which networks are able to pick up and learn the information. This is an important factor in the case of RNNs since the data is propagated back into the network and if the values keep reducing as they propagate through the layers, then eventually it would tend to zero, and that would make the model useless. Thus to overcome this, the output is directly repropagated to input layers, and not to the middle layers.

ISSN: 2088-8708
Now as this process is happening, the other problem is that, as it learns new information, the system fails to keep track of the previously learnt information. To overcome this caveat, an LSTM style of RNN is used. LSTM (14). The LSTM variant of RNN essentially eliminates the main drawback of the standard RNN, which is its ability to remember, and this variant works on a large class of problems.

Proposed Work
Many algorithms exist to predict keywords, but most of such systems exist for data which is quite large. The authors want to design a system that has a low-overhead, in terms of model building or the amount of data required to predict. The model the authors have chosen is an GRU variant of the Recurrent Neural Network. The authors split the overall process as three subprocesses and then expand each of those subprocesses' individually as shown in Figure 3

Data Collecting and Preprocessing
Building the model Evaluating the model

Sub Process 1: Data Collecting and Pre-processing
For this, the data is of homemade or self-made dataset. It is a collection of questions, stored in a text file, interleaved with a new line. An example of the dataset is shown in Figure 4. First, the question is searched on a normal internet search engine, and then the results obtained are processed. This is done by first converting all of the content to lowercase, and then the common stop words are removed. For this task, one uses a certain application called SpaCy (15), and NLTK (16), (17) which is used for natural language processing, to clear out the said words and the punctuation marks as well (While, these punctuation marks are useful for a normal human reader,they only add noise to the model). Then, depending on the type of the question, the user can also filter out numeric values which may be present (Having numeric values is useful, only if the answer keywords that are being predicted have them, else they add noise to our model). Once this pre-processing is repeated for all the fetched data, it is stored into a file, and this becomes our input data to the model. The process for which is shown below, Figure 5.

Sub Process 2: Building the model
Here, the input data is split into chunks of certain size. These chunks form the input vectors for the model. Each chunk forms a single input vector and each value from the vector is provided as an input for the first layer of the RNN. So, logically, once these values are "encoded", the user can start to train the model on these values. After they are processed by the first layer, the output is captured. This becomes the first state. This state is retropropagated as input, back into the model. As the model goes over a new encoded vector each time, it remembers what it has trained on as well as the newer information that it has learnt.
Suppose, the first input is considered to have an error and the user isn't allowed to modify it, then the magnitude of the error multiplies as the model builds and it causes problems. But, with an RNN one can tweak the data that is propagated from the subsequent layers, thus being able to diminish the error rate in the system. This is because, one can control the amount of information that is being retropropagated back into the system. So, if the first input contains error, then one can take the output after the first iteration, tweak that output and then provide that as the input for the second iteration. So, using the dataset once can train the whole model, and after this one can start evaluating the model.

Sub Process 3: Evaluating the model
There are two possible ways to evaluate the model. The first method is the manual method, where in the authors manually verify the predictions, by noting the answer keywords and then, manually searching for the query and verifying the results. This method is a bit tedious. The other method is where one can have a program that automates this. This would make it much simpler. In the case of automation, the process would be similar to the manual process. The authors provide a set of questions at a time, five questions, for example. Now, they perform the operation for each of the question, and capture the output. Then, they tokenize the captured output, and compare this with a set answer keywords, that the authors themselves have provided.
If the provided answer keywords is said to be found in the tokenized words, then the model is said to have predicted the answer correctly, else the model has predicted inaccurately.

RESULTS
As mentioned in the section 3.5., for the automated process, one has obtained a set of questions from the user, and provided this as the testing dataset for the model. After each complete run of the model, the output was captured and compared with the answers that was provided by the users themselves.
One of the ways in which the system was tested, was by manually providing questions. Here the authors consider two questions, the first being, "Who won the 2017 Nobel prize for economics?" and the second question being, "Which company has launched the Indian Pale Ale (IPA), its fifth beer in India".
For the first question the obtained results are shown in Figure 6. After providing this as the dataset for the model to train, the system is able to predict certain sequence of words which would match the results that would be obtained from an internet search query. This is shown in the following figures. Answer keyword patterns predicted by the system is shown in Figure 7 and the answers obtained from internet search is shown in Figure 8.
The results of the second question was also verified in a similar manner. The results dataset is as shown in the Figure 9. This is provided as input for the system to train on, and once trained the system begins to generate answer keywords patterns, similar to the previous example and is shown in the Figure 10. The observation to note here is that, when queried via the internet search engines, similar results are yielded, as shown in the Figure11 The observed training loss for each of the questions is shown in the following figures. The graph of the training loss for the first question is shown in Figure 12 and similarly for the second question in Figure 13. One general observation to note in both the cases is that for each iteration, the amount of text being processed increases, and the training loss decreases; indicating that the model is getting more effective in predicting keywords.    Thus, in a similar manner, a total set of twenty-five questions was obtained from the user. And for each run, a specific number was chosen (for the first run, one question was chosen, on the second run, five questions were taken, and so on). Each run of the whole operation was performed for five times, per set of questions. The accuracy score was calculated as the average number of times the right answer was predicted. Upon examination it can be noted that, as the number of number of questions increases, the accuracy falls, but it eventually squares off at 74%. Upon multiple runs, it was found that the accuracy score averages out to the same levels of accuracy, that is aforementioned.
The area where the accuracy of the model falls is, if the answers were to contain a numerical value embedded in it, then the model would not able to predict those accurately. An example of these type of answer keywords would be "350m", or "12ft". If the model has to predict these type of answer keywords, they need to occur enough number of times be an observable pattern for the model to pick up on, and the probability that such type of keywords occurring, is quite less. But, if they are split (such as "350" and "m"), then the numbers and the attached string parts individually have no meaning, and only add to the noise when training the model. Hence, the accuracy falls in these types of questions.
The other observation is that, as the number of questions increase, the accuracy falls. This is because there would be certain questions that would have numerical-embedded answers in them, and when cleaning or filtering them out, one would lose the vital information, which would not help the model. Thus, leading to the accuracy score dipping in value.

CONCLUSION AND FUTURE SCOPE
In terms of future scope, one of the main points where the accuracy of the model can be improved is in terms of the data that is being provided to it, and the training being done on it. As mentioned, the fall in the accuracy score occurs when the answer keyword contains an embedded numerical value in it. Thus, a possible future improvement for this research would include allowing the model to train on data with embedded numerical values, so that the accuracy score of the model could be improved.
On a concluding note, the authors would like to say that it was an exciting to delve into this exciting research, and was an interesting concept to explore. While performing this research, the authors learnt about various tools and concepts that are used in the research, and in the academic spheres. They were exposed to, and learnt various concepts of Machine Learning and Natural Language Processing.