Enhanced TextRank using weighted word embedding for text summarization

ABSTRACT


INTRODUCTION
According to the Datareportal statistics in 2022 [1], Indonesia is a country where 204 million (73%) of the population use internet. Every day, Indonesians spend an average of 20 percent of their internet time reading the news. Their interest in reading the news is also supported by the increasing number of Indonesian online news portals, which is estimated to reach 43,300 media based on the Press Council statistics in 2017 presented in [2]. Even though the public's interest in reading the news is quite large, reading long stories can take a long time. It then may discourage them to read through the news, especially because people in the digital era tend to expect to find information from various sources quickly [3]. As a result of this tendency, we often see the term "too long; didn't read" or "tl:dr" which is an outline or summary of writings that are spread on the internet. This term is aimed to let other people quickly understand the main point of the text. The presence of the term "tl;dr" itself also shows that the length of a text affects one's reading interest.
It is beneficial if a lengthy text can be condensed into a more compact form, i.e., summary, so that it could save time and effort of readers to find important information from the text [4]. According to Radev et al. [5], a summary is a text produced from one or more texts, which conveys important information Section 2 describes the dataset, system framework, and evaluation method. Section 3 describes our experimental results together with the analysis. At last, section 4 concludes our study.

METHOD 2.1. Dataset
This work uses Liputan6 dataset [9], a large-scale Indonesian dataset for text summarization. The documents in this dataset come from news articles on the Indonesian online news portal Liputan6. It covers a variety of topics, such as: politics, business, and technology. This dataset was automatically built by Koto et al. [9], extracting the short description contained in the webpage metadata for each news article in the Liputan6 website, as the ground truth summary for the article. We use the Canonical set of the Liputan6 dataset, which is a complete version of this dataset, to obtain more comprehensive results in our experiment. Since the main summarization method used in this work is unsupervised, i.e., TextRank algorithm, we then only use the test split of the dataset which consists of 10,972 documents. The statistics of this dataset is presented in Table 1. It appears from the Table 1 that the documents in our dataset are short to moderate in length, which is around 12 sentences per document. The ground truth summary is also relatively short with 2 sentences on average, ranging from 1 to 5 sentences. For evaluation purposes, the length of document summaries generated by our summarization system will vary, following the length of ground truth summaries for the corresponding documents. For example, if a ground truth summary for a document has 3 sentences, then our summarization system will also produce a 3-sentence summary for that document.
One of the conveniences provided by Liputan6 dataset is that the tokenization has been done on each sentence of the news document, and each sentence has also been tokenized into words. This makes us easier to preprocess the data. A preprocessing that we performed on the dataset is only case-folding. Figure 1 illustrates an example of a document in the Liputan6 dataset that is used in this work (see the right column of the table). We can see that the documents in the Liputan6 dataset are in the form of a list of sentences, where each sentence is in the form of a list of words. It also does not include the document titles. The original document is presented in the left column of the table. Note that the English translation is not a part of the dataset, but we display it in the table to help readers understand the content of this document. This document example consists of nine sentences, and its ground truth summary consists of two sentences.   Figure 2 illustrates the flow of the process in our summarization system. Initially, embedding models are learned using a large corpus. For Wod2Vec and FastText models, we train them using Indonesian Wikipedia dataset. For the BERT model, we use Indonesian BERT-based pre-trained models from previous work, i.e., IndoBERT [32]. When a document to be summarized is inputted into our system, at first it needs to be split into sentences (s). Then, word embedding for each word in the sentences is generated from the embedding models. The red box in the figure highlights a slightly different process in generating word embedding using Word2Vec/FastText and BERT models. The dashed blue arrows inside the red box indicate conditional flow. When we want to generate word embedding using Word2Vec/FastText, then we follow the top dashed arrow. Otherwise, we follow the bottom dashed arrow when we want to generate BERT word embedding.

System framework
Because Word2Vec and FastText are traditional word embedding, there is only one static vector representation for each word. Therefore, each sentence needs to be split into words first. Then, given a word ( ), the Word2Vec/FastText model will generate a word embedding or word vector ( ). This is different from BERT which is a contextualized word embedding, so it considers context from the right and left word sequences when computing a word embedding. As a result, the model will give different embedding for a word depending on its context, therefore, the word embedding for any word is not static. Since the model needs context, then we directly input a sentence into the model to produce word embedding for each word in the sentence.
Further, TF-IDF weighting is computed using our Liputan6 dataset to assign a weight for each word based on their occurrence statistics in the dataset. The TF-IDF scores are used to weigh the word embedding generated earlier and then produce the weighted word embedding or the weighted word vector ( ). Next, sentence vectors ( ) are formed by taking the average weighted word vectors that compose the sentences. After sentence vectors are obtained, cosine similarity between any two sentences in the document is computed to build a sentence similarity matrix. TextRank summarization algorithm is then applied to score the sentences based on their importance among all other sentences in the document using graph-based method. To generate a summary, a top-k sentences with the highest scores will be extracted and then ordered according to the order of sentences in the document.
In general, our summarization system contains three main components. They are word embedding, TF-IDF weighting, and TextRank summarization algorithm. More detailed explanation about these components is given in the following subsections.

Word embedding
Word embedding concept was initially introduced by Bengio et al. [34] as a distributed representation of words, obtained after jointly training word embeddings with a model's parameter in a neural network model, that can preserve the semantic and syntactic relationship between words. Word embedding represents a word as a vector learned from a large collection of data, in which words that have similar meanings (semantically similar) will have vectors that are close in the vector space. Word embedding is generally more effective than classic bag-of-words vector representation because of its ability to capture the semantic relationship between words. In addition, since word embedding has dense representation, then it also solves the sparsity problem in the classic representation, which makes the computation more effective and efficient. Therefore, most of the current research in text processing uses word embedding as the word representation [35]- [38]. There are three variants of word embedding that are explored in this work: Word2Vec, FastText, and BERT. They are detailed in the following. a) Word2Vec Mikolov et al. [29] introduced two neural architectures to learn Word2Vec word embeddings: the continuous skip-gram model and the continuous bag of words (CBOW) model. Both models learn vector representations of words that can capture semantic information as well as linguistic regularities and patterns. CBOW works on the task of predicting the current word ( ) based on context words ( ± ), where = 1,2,3, … , , with n denotes the window size. On the contrary, Skip-gram works on the task of predicting context words ( ± ) based on current word ( ). Window size is one of the hyperparameters in the Word2Vec algorithm, which can be set during the training process, that indicates the maximum distance between the current and predicted words within a sentence. The weights in the neural network learned during the training process will serve as the elements of word embedding. Mikolov et al. [29] found that the skipgram model can capture semantic relationships better than the CBOW model. Therefore, the skip-gram model is chosen in this work to build the Word2Vec model.
We use a Python library Gensim to implement the Word2Vec algorithm and train the corresponding embedding model using Indonesian Wikipedia dataset. We use Gensim's default value for the hyperparameter values to build the Word2Vec model, as follows: learning = 0.025, ℎ = 5, and window = 5. We train the Word2Vec model using Indonesian Wikipedia dataset because it is a large dataset that is publicly available. Using the same reason, there are also many previous researchers who also learned their Word2Vec models using Wikipedia dataset [39], [40].

b) FastText
FastText is an extension of Word2Vec embedding proposed by Bojanowski et al. [30] that takes into account the morphology of words (i.e., subwords). In the original Word2Vec embedding, every word has its word embedding that is generated without considering the similarity in the word morphology. So, it is possible that two morphologically similar words do not have similar word representations. This is considered a drawback of Word2Vec for a language that is rich in morphology [30]. FastText operates at a more granular level which is at the character n-gram level, while Word2Vec works at the word level. Character n-gram is a sequence of n characters within a given character window. In FastText, a word is represented by a sum of its character n-grams. Therefore, word embedding is obtained by summing up the vector representation of its character n-grams. For example, Indonesian word "melihat" (English translation: "see") with a window size of 3 for example, will be split up into seven subwords (character 3-gram): "<me", "mel", "eli", "lih", "iha", "hat", "at>", and the word embedding for that word is the sum of vectors for those seven subwords.
Using the above mechanism, FastText can learn representations for morphologically rich languages better. Words that have similar morphology (sharing many overlapping character n-grams) will be closer in vector space. In addition, it also enables rare words to be represented appropriately by taking the summation of embedding of its subwords. As a result, we can infer the embedding of unseen words that do not exist in the training corpus, which therefore helps to tackle the out of vocabulary (OOV) issue. Similar to Word2Vec implementation, we also use a Python library Gensim and Skip-gram architecture to implement the FastText algorithm and train the corresponding embedding model using Indonesian Wikipedia dataset. c) Bidirectional encoder representations from transformers BERT is a contextualized language model that was first introduced by Devlin et al. [31]. BERT architecture contains a stack of encoders from the transformer model [41], that learns the word context of a language in a bidirectional way. While Word2Vec and FastText belong to a traditional word embedding in which one word will always have one embedding, no matter what words exist before or after it, BERT belongs to the contextualized word embedding in which the word embedding is assigned by looking at the context around that word. Therefore, it is designed to better understand the contextual relationship between words. BERT model can be fine-tuned to directly solve the downstream tasks. In this work, however, BERT is only used in a feature-based approach to extract word embedding to be incorporated with the TextRank algorithm. The performance will then be compared against Word2Vec and FastText models.
Recently, BERT was pre-trained on a large-scale of Indonesian corpus consisting of around 4 billion words, referred to as IndoBERT [32]. IndoBERT has some variants that differ in the model architecture, such as the number of encoder layers, the number of hidden layers, and the number of attention heads. Two main variants of IndoBERT that will be investigated in this work are IndoBERTbase and IndoBERTlarge. Table 2 highlights the differences in the architecture between IndoBERTbase and IndoBERTlarge. It appears from the table that IndoBERTlarge has twice more encoders and almost three times more neural network parameters than IndoBERTbase. To examine to what extent this difference will affect the effectiveness of the resulting word embedding, we compare the results of our summarization system using contextualized embeddings from these two architectures. To use pre-trained IndoBERT models, we use library transformer in the hugging face website that provides application programming interface (APIs) and tools for downloading and training some state-of-the-art pre-trained language models.

TF-IDF weighting
TF-IDF is an attempt to give weight to words contained in a text based on their occurrence in the collection. In this work, TF-IDF score is used to weigh the word embeddings (word vectors). In general, TF-IDF estimates how important words are in a document, while also considering their importance in the collection. The formula to compute TF and IDF scores are (1) and (2), where , represents the term frequency (TF) of a word in a document , and ∑ (̂, )∈ indicates the document length, i.e., the total terms in the document . The higher the number of occurrences of a term in a document, the more important that word in the document. Then, ( ) represents the inverse document frequency (IDF) of a word in the collection , where | | is the number of documents in the collection and | ∈ : ∈ | is the number of documents in the collection that contain a word . The more documents in a collection containing a particular word, the more general the word is. To sum up, words that often appear in a document, but do not often appear in the corpus will be assigned a high TF-IDF weight, indicating that they are important words.

TextRank
TextRank is a graph-based ranking algorithm for scoring the text [8], where the text unit can be specified, such as keywords or sentences. TextRank is an extension of PageRank [42] which was originally aimed for webpage ranking. In this paper, TextRank is used for sentence ranking purposes. The relationship between sentences in the document is a factor that contributes to the score given by TexRank to each sentence. To calculate the relationship between sentences, we use cosine similarity, following Barrios et al. [24]. The resulting pairwise cosine similarity scores between sentences are then used to construct a similarity matrix that serves as a basis to form a weighted graph of sentences.
In the weighted graph representation, a vertex represents a sentence, and an edge between two vertexes represents the relationship/connection between two sentences. The similarity score is used as the weight of the edge between those two sentences. Two sentences that are more similar will have a higher weight on the edge connecting them. Based on this graph representation, we then apply TextRank algorithm to score each sentence in the document using (3), where is the -th vertex and ( ) denotes the score for the -th vertex, which basically represents the sentence score. ( ) and ( ) respectively denotes the set of vertexes that go into and the set of vertexes that go out from . is a damping factor, which is the probability of randomly moving from one vertex to another vertex in the graph. and respectively denotes the weight of edge between sentences and , and the weight of edge between sentences and . TextRank is an iterative algorithm that will stop the iteration when convergence is achieved. It is when there is a sentence in the document whose the difference between its scores in two successive iterations is already very small so we can assume that in the next iterations the scores will not significantly change anymore. When convergence is achieved, all sentences in a document have final scores. We will then take N sentences with the highest scores as a summary. Since the length of the summary in our ground truth dataset is varied, then N is adjusted with the actual length of the summary for a document. In our implementation, we use Python library sklearn to compute the cosine similarity between sentences. Then, the Python library NetworkX is used to implement the TextRank algorithm. The damping factor is set to 0.85, following the implementation of original PageRank [42] and TextRank [8]. For an original TextRank that is used as a baseline method in our experiments, the word representation uses a vector of bag-of-words as in the original paper [8]. For our summarization methods, we use word embedding from Word2Vec, FastText, and IndoBERT as word representations.

Evaluation method
The evaluation of our experiment results is performed using recall-oriented understudy for gisting evaluation (ROUGE) metric [43], which is a common metric for extractive text summarization. In general, ROUGE calculates the word overlap between automatic summaries and ground truth summaries. The types of ROUGE to be used are ROUGE-N and ROUGE-L. While ROUGE-N counts the number of n-grams overlapping between two summaries, ROUGE-L counts the LCS (longest common subsequence) between two summaries. We choose F-1 score as the type of measurement for our ROUGE scores calculation since it considers both Precision and Recall. A Python library rouge_score is used to calculate ROUGE scores for our summaries. Table 3 describes the performance of our TextRank-based summarization systems using (unweighted) word embedding from Word2Vec, FastText, and IndoBERT. These systems are compared to the original TextRank method as the baseline. We can see that the ROUGE scores of all systems using word embedding are higher than the baseline system. Here, the use of word embedding is shown to significantly increase the performance of the original TextRank algorithm. It also appears from the table that the systems using IndoBERT are significantly more effective than those using Word2Vec and FastText. This confirms one of our contributions in this work which is using the contextualized word embedding BERT in the TextRank algorithm. The advantage possessed by BERT occurs because its word embeddings are generated by looking at the context of the surrounding words (this is different from Word2Vec and FastText models that adopt static word representation as it does not consider the context when generating word embedding). As a result, the word embedding from BERT is more accurate to capture semantic relationships between words. A better word representation then results in a better sentence representation, which further contributes to a better estimate of sentence importance by the TextRank method. This explains the success of BERT embedding in enhancing the performance of the TextRank method the most.

The effectiveness of enhanced TextRank using (unweighted) word embedding
TextRank using IndoBERTlarge gains slightly higher ROUGE scores than that using IndoBERTbase, and this result is consistent with the findings reported in previous work [44]. The statistical test, however, shows that this difference is not significant. This slight superiority of IndoBERTlarge over IndoBERTbase is affected by the higher number of layers as well as parameters used in its neural model, which makes it slightly better at capturing the meanings of words.   Table 4 shows the performance of the enhanced TextRank methods using weighted word embedding. Compared to the results using unweighted word embedding presented in Table 3 earlier, it is clear that the application of TF-IDF weighting to the word embedding gives significant performance increases to all summarization systems. This increase can occur because by using weighted word embedding, the importance of words in the collection is considered when making the sentence representation. Consequently, it improves the generated sentence vectors which then results in enhancing the performance of TextRank method to estimate the importance of sentences. Symbol ‡ illustrates the significant difference against the corresponding methods without using TF-IDF weighting, as displayed in Table 3. Then, symbols *, +, ×, and ÷ denote significant differences against TextRank, TextRank + Word2Vec + TF-IDF, TextRank + FastText + TF-IDF, and TextRank + IndoBERTbased-methods + TF-IDF, respectively, according to the paired t-test (p <0.05)

The effectiveness of enhanced TextRank using weighted word embedding
The increased levels of systems performance over those without using weighted word embedding are 2.64% to 7.04% in ROUGE-1, 4.48% to 12.92% in ROUGE-2, and 2.78% to 7.81% in ROUGE-L. Systems that are benefited the most from the TF-IDF weighting are those using the Word2Vec and FastText models, which respectively achieve 12.92% and 12.78% increases in ROUGE-2. For systems using IndoBERTbase and IndoBERTlarge models, they only gain 4.83% and 4.48% increases in ROUGE-2, respectively. Here, we can see that the performance increases obtained by the systems using Word2Vec and FastText models are almost three times higher compared to those using the BERT models. We analyze that this could happen because the TF-IDF weighting can help to reduce the limitation of static word embeddings generated by Word2Vec and FastText, which do not see the contextual meaning of a word. Therefore, the TF-IDF weighting can give extra information on the importance of words in the collection to the word representation. In addition, the way TF-IDF works is almost the same as word embedding from Word2Vec and FastText, where each word has only one TF-IDF weight no matter what words occur before or after it as it does not consider the context. Consequently, it is more effective for Word2Vec and FastText, rather than BERT.

More about the summary results
We analyze some examples of the summaries generated using our systems and compare them with ground truth summaries. Table 5 shows an example of the results of our summaries generated for a document with id "16,379" in our dataset using TextRank and (unweighted/weighted) word embedding based on Word2Vec, FastText, and BERT. The ground truth summary as well as the baseline summary (TextRank) for this document are also displayed in the table for comparison. The text printed in boldface indicates the text that exists in the ground truth summary. The more the text in an automatically generated summary is bolded, the better the summary is, since it indicates that it is more similar to the ground truth summary.
It appears that the summary generated using the baseline method contains only one sentence (out of two sentences) from the ground truth summary. The result is the same as those of TextRank+Word2Vec and TextRank+FastText systems. In this case, the use of unweighted word embedding from Word2Vec and FastText is unable to improve the baseline method, i.e., TextRank.
However, after adding the TF-IDF weighting to the word embedding from TextRank and Word2Vec, it succeeds to enhance the TextRank method, which enables the summaries generated by these systems to be the same as the ground truth summary. This example demonstrates that weighted word embedding is effective for Word2Vec and FastText, as it can improve the performance of unweighted word embedding. Meanwhile, the use of the IndoBERTbase and IndoBERTlarge word embeddings, even without TF-IDF weighting, has succeeded to produce summaries that are the same as the ground truth summary. Further using the weighted word embedding does not change the summary results for IndoBERT-based systems. This confirms our results in the earlier subsection that since IndoBERT considers the context to produce word embedding, then using the original word embedding can already capture the semantics of words well. Therefore, the effect of TF-IDF weighting for this method is smaller compared to that for static word embeddings (Word2Vec and FastText).

CONCLUSION
We propose to use weighted word embedding to enhance TextRank algorithm for text summarization on Indonesian news dataset. A variation of word embeddings, such as traditional word embedding (Word2Vec & FastText) and contextualized word embedding BERT, combined with TF-IDF weighting are exploited in our methods to be incorporated with the TextRank. The effect of the weighted word embedding (as well as unweighted word embedding) on the performance of TextRank summarization method are examined in this work.
Our experimental results show that the use of (unweighted) word embedding improves the performance of the TextRank method by up to 13.25% in ROUGE-1 and up to 22.02% in ROUGE-2. The system using word embedding from BERT is shown to achieve the highest performance increase. BERTbased summarization systems gain 5.98% and 7.94% higher ROUGE-2 scores as compared to the systems using Word2Vec and FastText, respectively. This indicates that BERT word embedding, which is generated by considering the context of the surrounding words, produces better word representation than Word2Vec and FastText, which further results in better estimation of sentence importance in TextRank. When each word embedding is weighed using TF-IDF score, we found that the performance for all systems using unweighted word embedding significantly improve. However, BERT-based systems only gain a little improvement (with 2.64% to 4.83% increase), while the biggest improvement is achieved by the systems using static word embeddings, i.e., Word2Vec (with 6.80% to 12.92% increase) and FastText (with 7.04% to 12.78% increase). Overall, our systems using weighted word embedding can outperform the original TextRank method by up to 17.33% in ROUGE-1 and 30.01% in ROUGE-2. This shows that our proposed methods succeeded to give a significant improvement over the original TextRank method.