Automatic summarization of Malayalam documents using clause identification method

Sunitha C, A Jaya, Amal Ganesh


Text summarization is an active research area in the field of natural language processing. Huge amount of information in the internet necessitates the development of automatic summarization systems. There are two types of summarization techniques: Extractive and Abstractive. Extractive summarization selects important sentences from the text and produces summary as it is present in the original document. Abstractive summarization systems will provide a summary of the input text as is generated by human beings. Abstractive summary requires semantic analysis of text. Limited works have been carried out in the area of abstractive summarization in Indian languages especially in Malayalam. Only extractive summarization methods are proposed in Malayalam. In this paper, an abstractive summarization system for Malayalam documents using clause identification method is proposed. As part of this research work, a POS tagger and a morphological analyzer for Malayalam words in cricket domain are also developed. The clauses from input sentences are identified using a modified clause identification algorithm. The clauses are then semantically analyzed using an algorithm to identify semantic triples - subject, object and predicate. The score of each clause is then calculated by using feature extraction and the important clauses which are to be included in the summary are selected based on this score. Finally an algorithm is used to generate the sentences from the semantic triples of the selected clauses which is the abstractive summary of input documents.


Abstractive summarization, Clause Boundary, MBT tagger, Morphological Analysis, Sentence score

Full Text:



Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

International Journal of Electrical and Computer Engineering (IJECE)
p-ISSN 2088-8708, e-ISSN 2722-2578