Myanmar news summarization using different word representations

Received Feb 25, 2020 Revised Jul 29, 2020 Accepted Aug 9, 2020 There is enormous amount information available in different forms of sources and genres. In order to extract useful information from a massive amount of data, automatic mechanism is required. The text summarization systems assist with content reduction keeping the important information and filtering the non-important parts of the text. Good document representation is really important in text summarization to get relevant information. Bag-ofwords cannot give word similarity on syntactic and semantic relationship. Word embedding can give good document representation to capture and encode the semantic relation between words. Therefore, centroid based on word embedding representation is employed in this paper. Myanmar news summarization based on different word embedding is proposed. In this paper, Myanmar local and international news are summarized using centroid-based word embedding summarizer using the effectiveness of word representation approach, word embedding. Experiments were done on Myanmar local and international news dataset using different word embedding models and the results are compared with performance of bag-of-words summarization. Centroid summarization using word embedding performs comprehensively better than centroid summarization using bag-of-words.


INTRODUCTION
Nowadays, information is increasing gradually on the internet and it is necessary to compress different types of data. Summarization made by human is very time consuming and tedious. Therefore, automatic text summarization is essential to overcome the problem. Text summarization is a technique for extracting essential information from original text document as a shortened form.
Automatic text summarization systems can be generally classified into two different types. These two main approaches are: extractive and abstractive [1]. Extractive summarization is formed by extracting phrases or sentences from the document to form summary. The main goal of extractive summarization is to produce summary without redundancy and give important point of souce document. Abstractive summarization uses new words to form the summary to describe the main content. Extractive summarization method generally has three steps: -Intermediate representation model -Scoring the sentences based on the representation and Selection of a summary comprising of a number of sentences For the first step, there are two types to represent the input text: topic representation approaches (centroid based method, latent semantic analysis, Discourse based method, Bayesian topic models) [2][3][4] and indicator representation approaches (graph-based method, machine learning) [5][6]. When the intermediate representation is generated, an importance score is assigned to each sentence in second step. Finally, the summarizer system selects the most important sentences to produce a summary.
Previous Myanmar text summarization system uses machine learning approach, CRF, takes information extraction as sequence labeling task. Machine learning approach needs large training dataset, so that the system performance is better. Kyaw [7] used CRF (conditional random field) for word segmentation and information extraction. Seven types of Myanmar natural disasters news (flood, landslide, earthquake, forest fire, storm, volcanic eruption, tornadoa) are summarized by using template driven text summarization approach. That system does not consider other features such as POS that make information extraction task better. That system does not consider anaphora resolution. In order to improve performance, more data should be collected to get greater training corpus. Kyaw [8] described query-focused multi-document summarization of Earthquake news in Myanmar. That system only describes word level summary of Myanmar news by using forward-backward algorithm. Longest matching approach is used to reduce redundant information. Our system is aimed to produce sentence level summary instead of word level summary.
Text summarization using semantic role labeling was proposed by Naing [9]. In that paper, anaphora resolution and semantic role labeling algorithm for Myanmar language was proposed. And then Myanmar verb frame resource based on PropBank semantic resources was built using Myanmar-English Computational lexicon and Lexique Pro lexicon. Finally, both entity and its reference in the sentence are chosen for summary generation. Verbs in spoken sentences are not identified.
According of previous Myanmar text summarization research, word level summary is only produced. The main goal of this paper is to produce sentence level summary. Although there are other language text summarizers like Thai, Korean, Chinese, Myanmar language summarizer are rare now. According to these problems, Myanmar text summarization system is definitely necessary to develop in the area of Myanmar NLP research.
For representing text, bag of words models are used. Although it is easy to implement, it has some limitations such as it ignores the semantic of words in the document. For calculating scores for sentences and ranking sentences for generating important sentence, bag of words representation is mostly used. But there are limitations in bag of words model. Vector representation in bag of words model cannot capture semantic relationships of words if they have no words in common. To overcome the limitation of bag of words representation model, word embedding is used in this paper. Word embedding provides a better vector feature on most of natural language processing problem. Word embedding is a dense vector representation and can capture syntactic and semantic information of a word. This paper describes the comparison of centroid summarizer by taking advantages of different word embedding model and baseline bag-of-words model [10].

MYANMAR LANGUAGE
The Myanmar language is a Sino-Tibetan language that is spoken in Myanmar and it is an official language. Burmese script is adapted from Mon script, which is derived from Pali, the language of Theravada Buddhism. It is tonal language. It is written horizontally from left to right and Myanmar language consists of 33 consonants and 14 vowels. The word order of Myanmar language is Subject-Object-Verb.
Myanmar language is an under resourced language and there are no text summarization datasets like other languages. There are Thai, Japanese, and English text summarization systems. However, Myanmar text summarization systems are very few and rare. Myanmar text summarization using conditional random fields (CRF) and Myanmar verb frame based summarization system have been proposed. Nevertheless, Myanmar summarization system using the effectiveness of word embedding have not been proposed yet. Therefore, Text summarization using efficient of word embedding is proposed in this paper.

WORD EMBEDDING
Word embedding is an NLP technique, which can capture the meaning of a word in a document, semantic and syntactic similarity, also the relation with other words. These are vector representations of a particular word. The main purpose of Word2vec is to group the vectors of similar words together in vectorspace [11]. There are two architectures in Word2vec. These are continuous bag of words (CBOW) and skip-gram. CBOW predicts a target word based on its surrounding words. Skip-Gram predicts the surrounding words based on target word [12].

2287
There are many word representation models: Word2Vec and Glove. But Word2vec and GloVe fail to provide any vector representation of words that are not in the corpus. FastText works well with rare words even if a word wasn't seen during training. Therefore, fastText and BPEmb embedding are used in this paper. FastText is a free library for learning word representation and sentence classification [13]. There are publicly available pretrained word vectors for many languages. BPEmb (byte-pair embeddings) is a collection of pretrained subword embeddings based on byte-pair encoding (BPE) and trained on Wikipedia [14]. It is used as input for neural models in natural language processing. Subwords allow guessing the meaning of unknown words. Byte-Pair encoding gives a segmentation without requiring tokenization or morphological analysis. The vocabulary size is the sum of the number of BPE merge operations and the number of characters in the training data.

DESIGN OF PROPOSED SYSTEM
Sentences and documents are represented in some feature vector space. Centroid can be defined as the whole document's vector [15]. Summary sentences are selected by selecting sentences which have vectors similar to centroid vector. It can be done by using different representations. Original centroid approach uses bag of words as representation model for sentence scoring and selection. Bags of word model cannot grasp semantic relationship between sentences. Therefore, Rossiello [16] proposed centroid summarization by utilizing word embedding. They used word embedding to represent sentence and words. The centroid embedding is calculated as the sum of word embeddings of most important words, and sentence embeddings are calculated as sum of word embeddings they contain. Figure 1 depicts the design of the proposed system. The proposed system has four main steps; -Preprocessing step -Computing centroid embedding -Computing sentence embedding -Producing of summary sentence Firstly, input news documents are segmented into words and stopwords are removed in the preprocessing step. And then, Centroid embedding and sentence embedding are computed by using lookup table. Cosine similarity scores between the sentence and centroid embedding are computed and sentences are ranked by descending order according to similarity scores. Finally, final summary is produced. This paper is proposed to utilize the different types of word embedding for representing sentences. Quality of representation sentences can affect the performance of centroid based summarization. Therefore, centroid based on different type of word embedding is proposed in this paper.

Preprocessing
In English, word boundaries can be easily determined because there are spaces between words. In Myanmar language, words are written without space and sometimes spaces are used between phrases but there are no defined rules for using space. Therefore, word segmentation is a challenging task in natural langue processing. In this paper, Myanmar news are segmented by Myanmar word segmenter by [17]. A combined model, bigram and word juncture is used in that system. This segmenter uses longest matching and bigram method with a pre-segmented corpus of 50,000 words which are collected manually from Myanmar text books, newspapers, and journals. The corpus is in unicode encoding. In text summarization, stopwords are defined as irrelevant information. Therefore, Myanmar stopwords (prepositions and conjunctions) are removed from news document.

Centroid embedding
Words that have term frequency-inverse document frequency (TF-IDF) weight greater than a topic threshold are selected to compute centroid embedding. The vectors of the top ranked words in the document using lookup where, C is the centroid embedding related to the document D and with idx (w) a function that return the index of word w in vocabulary.

Sentence embedding
Vectors of each word in sentence in lookup table X are summing up to get sentence embedding. Sentence scores are calculated by (2): where, is the the j th sentence in document D. The cosine similarity between the embedding of the sentence Sj and that of the centroid C of the document D is computed as (3) to get sentence score.
|| ||.|| || As shown in (3) shows cosine similarity which measures how close two vectors C and S j are based on their angle.

Sentence selection
The sentences are sorted in descending order according to their similarity scores. The top score sentences are selected as summary until the predefined limit (number of words). In this paper, the number of words for generating summary is 150 words.

EXPERIMENTAL RESULTS AND ANALYSIS 5.1. Data set
Unlike the other language, there are no datasets for text summarization in Myanmar language. Therefore, daily update news from Myanmar news websites [18][19][20][21] are manually collected to build Myanmar news dataset. The data is unicode encoding. In Myanmar, unicode font is not very familiar and Myanmar people mostly use Zawgyi font. But unicode encoding is used as standard encoding in Myanmar natural language processing tasks. Now, these manually collected dataset consists of 2k news articles. The average length of each article is 5 sentences. There is no gold (reference) summary for datasets like English summarization dataset. Therefore, gold summary is generated by 10 human evaluators. The centroid summarizer is experimentally evaluated on Myanmar local and international news datasets. Centroid based on two different word embedding model (BPEmb, FastText) are used for sentence representation. BPEmb pretrained embedding for Burmese (Myanmar) language have been published recently [22]. In this paper, BPEmb, pretrained embeddings for Myanmar language with vocabulary sizes (200000) and dimension (300)  [23,24]. In this system, these word embedding models are applied for sentence representation. Table 1 shows the most relevance sentences of sample Myanmar news article by using centroid approach combined with word embedding approach. The first column shows sentence ID, and the second column shows sentences. The cosine similarity scores are shown in third column. Centroid words are selected by using TF-IDF value greater than a topic threshold. In the following article, centroid words are ကျန ်းမာရ ်း(health) ရကျာင ်းသာ်း(students) ရ ာင ့် ကကပ (monitor) ရ ်းရ ုံ (hospital) ရ ာဂါ (disease). The most important and relevant sentence ID 3 which contains many words that are closed to centroid vector. The centroid words in the sentences are marked in bold. The summary is provided until the limit (number of words) is reached. "A total of 59 people were evacuated from Wuhan-16 male students, 42 female students, and a three-yearold girl and they all are well on the third day of quarantine, with no fever and no respiratory symptoms," and "the condition of the three health workers who took part in the mission to bring the students back is also being monitored, and they are also in good health, and quarantine does not mean that they have been infected with the coronavirus and it is being imposed to prevent the possibility of an infection," he added.

Evaluation and analysis
One of the metrics for evaluation of text summarization is recall-oriented understudy for gisty evaluation (ROUGE) [25]. It works as comparing human summary (one or several reference summaries) and system summary based on n-grams. Precision in ROUGE means that how much of the system summary was relevant? Precision can be computed as (4): Recall is computed as (5). Recall in ROUGE simply means how much of the reference summary is the system summary recovering or capturing.
To find the best parameter value, different parameter configurations are tested with original centroid by using bag-of-words representation and centroid by taking advantages of different word embedding models. Parameter setting are topic threshold (topic-t) in [0, 0.5] and similarity threshold (sim-t) in [0.5, 1] respectively. Two pre-trained word vectors, (FastText trained on Wikipedia, and BPEmb trained on Wikipedia) are used for centroid and sentence embedding. Figures 2-4 show the evaluation results of original centroid by using bag of words representation and centroid with different word embedding models. In Figure 2, topic threshold 0.4 and similarity threshold 0.98 is set and rouge 2 scores of centroids based on different word embedding models performed better than the baseline original centroid based on bag-of-words summarizer. Where tt is topic threshold and sim-t is similarity threshold. BPEmb is pre-trained word vectors learned on Wikipedia sources, and FastText is FastText pretrained word vectors learned on Wikipedia sources. In Figure 3, topic threshold 0.3 and similarity threshold 0.96 is set and the results assert that rouge 2 score of centroid summarizer based on FastText pretrained word vectors trained on Wikipedia performed better than all other representation schemes. Centroid summarizer based on BPEmb, pre-trained word vectors trained on Wikipedia is slightly less than the baseline bag-of-words models. As shown in Figure 4, topic threshold 0.2 and similarity threshold 0.95 is set and rouge 2 scores of centroid summarizer based BPEmb embedding model achieve the best rouge scores. Bag of word model gets higher rouges scores than FastText. A possible reason is that word embedding needs higher threshold because it requires accurate choice of meaningful words to compose the centroid vector.

CONCLUSION
In this paper, the relevance sentences are extracted by choosing sentences which have vectors similar to centroid vector. Word embedding is utilized to capture contexts of the words with dense representations in the form of numeric vectors. In this paper, word embedding is applied to sum the word vectors from the trained word embeddings to form sentence and document embeddings. The experimental results on Myanmar news data show that centroid summarizer based embedding model improve the performance of bag of words model. Word embedding which is better on what words is similar on syntactic and semantic relationship rather than bag-of-words (BOW). This centroid method will be applied in more complex summarization tasks such as multi-document summarization, query focused summarization in future work.