Text Preprocessing using Annotated Suffix Tree with Matching Keyphrase

Ionia Veritawati, Ito Wasito, T Basaruddin

Abstract


Text document is an important source of information and knowledge. Most of the knowledge needed in various domains for different purposes is in form of implicit content. Content of text is represented by keyphrases, which consist of one or more meaningful words. Keyphrases can be extracted from text through several steps of processing, including text preprocessing. Annotated Suffix Tree (AST) built from the documents collection itself is used to extract the keyphrase, after basic text preprocessing that includes removing stop words and stemming are applied. Combination of four variations of preprocessing is used. Two words (bi-words) and three words of phrases extracted are used as a list of keyphrases candidate which can help user who needs keyphrase information to understand content of documents. The candidate of keyphrase can be processed further by learning process to determine keyphrase or non keyphrase for the text domain with manual validation. Experiments using simulation corpus which keyphrases are determined from it show that keyphrases of two and three words can be extracted more than 90% and using real corpus of economy, keyphrases or meaning phrases can be extracted about 70%.   The proposed method can be an effective ways to find candidate keyphrases from collection of text documents which can reduce non keyphrases or non meaning phrases from list of keyphrases candidate and detect keyphrases which are separated by stop words.

Keywords


Text Mining

Full Text:

PDF


DOI: http://doi.org/10.11591/ijece.v5i3.pp409-420

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

International Journal of Electrical and Computer Engineering (IJECE)
p-ISSN 2088-8708, e-ISSN 2722-2578