Improving accuracy of part-of-speech (POS) tagging using hidden markov model and morphological analysis for Myanmar language

In Natural Language Processing (NLP), Word segmentation and Part-of-Speech (POS) tagging are fundamental tasks. The POS information is also necessary in NLP’s preprocessing work applications such as machine translation (MT), information retrieval (IR), etc. Currently, there are many research efforts in word segmentation and POS tagging developed separately with different methods to get high performance and accuracy. For Myanmar Language, there are also separate word segmentors and POS taggers based on statistical approaches such as Neural Network (NN) and Hidden Markov Models (HMMs). But, as the Myanmar language's complex morphological structure, the OOV problem still exists. To keep away from error and improve segmentation by utilizing POS data, segmentation and labeling should be possible at the same time.The main goal of developing POS tagger for any Language is to improve accuracy of tagging and remove ambiguity in sentences due to language structure. This paper focuses on developing word segmentation and Part-of-Speech (POS) Tagger for Myanmar Language. This paper presented the comparison of separate word segmentation and POS tagging with joint word segmentation and POS tagging.


INTRODUCTION
In numerous uses of characteristic language handling, Part-of-Speech (POS) labeling is an essential assignment for each language. So, to have high precision tagger is one of the importance tasks for NLP applications. Handling ambiguous and unknown words are the challenge of POS tagging [1,2].
For every NLP application such as machine translation, information extraction, speech recognition, grammar checking and word sense disambiguation, etc are needed to do word segmentation and Part-of-speech (POS) tagging of a fundamental process of natural language processing application. There are many methods for development of POS taggers. The most using techniques are rule based method, statistical based method and neural network based method. In the rule-based approach, rules are developed according to the nature of the language to define precisely how and where to assign the various POS tags [3][4][5]. This methodology has just been utilized to build up the POS tagger for Myanmar Language. In the factual methodology, measurable language models are manufactured, refined and used to POS label the info message naturally.Most commonly used statistical approaches are Hidden Markov Models based approach, Support vector machine based, Conditional Random Field based and Maximum Entropy based approach [6,7]. This paper describes Hidden Markov Models (HMM) and the proposed system for word segmentation and part-of-speech tagging for Myanmar language. Myanmar Language is morphologically rich, complex, and agglutinative in nature, expressions of which are arched with numerous linguistic highlights. POS labeling [8] is a significant issue in the field of NLP and one of the fundamental preparing ventures for any language in NLP. i.e., the capability of a computer to automatically POS tag a given sentence. Normally, the first step of processing is to divide the input text into units called tokens where each is either a word or something else like a number. The main clue used in space-delimited language like English is the white space. In major East-Asian languages such as Japanese, Chinese, Thai and Myanmar, there is no spaces between words. Myanmar language, its writing style does not use any delimiter between words.
In word segmentation and POS tagging, the structure of morphological words is the main source of information to get the correct process of tagging. By using the morphological structure of words, eliminate irrelevant tags can be removed and find the suitable tag for the word [9][10][11]. So, morphological analysis is an important part of language engineering applications especially for morphologically rich and complex language like Myanmar.
There has been very few research conducted on various language processing tasks including morphological analysis for Myanmar language compare to English, France, Chinese, India, and Thai., etc. Since high level language processing tasks such as POS tagging, machine translation, semantic analysis, syntactic analysis, sentiment analysis, information retrieval, classification, clustering system, etc. all process on smallest language unit; words. The morphology of the language through a systematic linguistic study is important in order to reveal words that are significant to users such as historians, linguists, etc.
Most of the current researches on Myanmar language done used a lexicon or dictionary or corpus which lists all the words forms for word segmentation as an initial stage of processing. To get correct segmentation, we need an exhaustive lexicon or corpus. Myanmar language [12][13][14][15][16] has been classified by linguists as a monosyllabic or isolating language with agglutinative features. Its writing style does not use any delimiter between words and so there is no way of knowing whether a word form of syllables is group , or is just a separate group of monosyllabic words. Every syllable has a meaning of its own. The Myanmar Language have complex morphotactic structures and has the ambiguous word segmentation. Therefore, segment the sentence to generate lexical and semantic of word sequences is a challenging task. Thus, this paper aim to addresses this shortcoming by proposing a language model that consider joint word segmentation and POS tagging. The rest of this paper is organized as follows. In Section 2, we discussed Literature Review. Section 3 described Aspect of Myanmar Language. Section 4 presented Design of Proposed System. Section 5 provides the Evaluation. Finally, we described the conclusion of the paper.

LITERATURE REVIEW
Part-of-Speech Tagger that using supervised learning approach for Myanmar Language is presented in [17]. For disambiguous of the POS tags, Baum-Welch algorithm and Viterbi algorithm with HMM model is used for training and decoding. For tagging a word, Myanmar lexicon is used with its all possible tags. The examination results show that the strategy got high precision (over 90%) for various testing input.
Myanmar Word Segmentation [18] used Hybrid Approach and the sentences are segmented in syllable and matched by longest words. In the using of Longest matching method, the words that are known from a dictionary are first segmented and the unknown words are guest from an n-gram model [19]. The major issue of this technique is comes from the vagueness in the longest coordinating procedure, since words can be showed up in numerous structures.
The porposed of Y. Zhang and S. Clark [20], that got a lower mistake rate contrasted with a two stage baseline system. The large combined search space for this method is a challenge and it is very hard in decoding. For reason for at the same time word division and POS labeling, a solitary straight model is utilized, and for joint preparing and pillar search of unraveling, the summed up perceptron calculation is picked. The joint model lessens a mistake pace of exactness for division to 14.6% and a blunder decline in labeling precision of 12.2%, contrasted with the conventional pipeline strategy.
A Persian POS tagger, the Persian sentences are tagged by implementing a blend of measurable and principle-based technique. To tag unknown words, a morphological analysis probabilistic method is used. Persian morphological rules that are knowledge base and that the probabilities is worked by a corpus is the second result of the research. Trial results show that their approach increase the labeling execution and exactness [11].

ASPECT OF MYANMAR LANGUAGE
Myanmar language is highly agglutinative and is morphologically rich and complex. Moreover, to separate each word, the Myanmar writing style do not use spaces and there is no chance to get of knowing whether a gathering of syllables structure a word, or is only a group of separate monosyllabic words. Every syllable has its own meanings. In Myanmar words consist of one or more syllables which are compound in different ways. Depend on the way of the words structures from syllables, these can be classify into three types single simple words, complex words and reduplicative words [21,22]. eg. ေ ပါင ်း (steam) + အ ်း(pot) =>ေ ပါင ်းအ ်း (rice cooker), မ ်း(fire) + ပူ (hot) => မ ်းပူ (iron), ပန ်း(flower) + ခ (carry) => ပန ်းခ (painting), all have their referential meaning and each monosyllable within words also has their own meaning. In Myanmar morphology processes include inflection, derivation, and compounding.

Derivation Morphology
Myanmar morphology derivation occurs by means of prefixation or suffixation. Derivation can change the POS tag of word forms. Derivation of nouns, verbs and adjectives are also achieved by suffixation but a circumfix also occurs in the Myanmar language. For example: အလ ပ (work) -> အ(Prefix) + လ ပ (do); ေ ြ ပးြ ခင း (running) -> ေ ြ ပး(run) + ြ ခင း (Suffix). But အ-is not prefix bound morpheme in some nouns and verbs and cannot be splitted; for example: if the words အေမ(mother) is splitted , it has not meaning.

Compounding
Myanmar words contain many compound words. They are noun compound words, verb compound words, adjective compound words and also noun, verb and adjective are compound.

DESIGN OF PROPOSED SYSTEM
The structure of the proposed framework is shown in Figure 1. There are two modules: preparing and testing modules. In the training phase, the collection of segmented and tagged-sentences are used to develop the proposed HMM model. This model is used in the testing phase. In testing phase, the input Myanmar sentences are identified into each sentence using the sentence end marker called pote-ma '။ '. After that, word segmentation and POS tagging is performed

Corpus Creation
Part-of-Speech tagged corpora are one of the essential resources for developing state-of-the-art POS Tagger in Myanmar. There are several steps to create tagged corpus. The following list demonstrates steps needed corpus building.
 Collecting raw text  Hand-annotating and preparing training data We collect and normalize raw text from online journals, newspapers and e-books. Since, documents used various Myanmar font styles; these are converted to standard Unicode format and and make cleaning such as spelling checking. We assign tags in un-annotated text manually and finally, we have got the training data for statistical method. If the number of tags is large, the complexity will be increased and the performance will be decreased. According to Myanmar grammar books and dictionary book [12][13][14][15][16], there are nine Part-of-Speech tags in Myanmar language. We have annotated every word with appropriate basic POS tags and created a POS tag Corpus. Moreover, we added another three POS tags Number, Symbol and Abbreviation in our research. The tagset is described in Table 1.

Training Hidden Markov Model
To get training data, we have to compute probabilities for each tag in the tagged corpus. Since we have developed a model, it produces two results. The results of the training phase are transition probabilities and emission probabilities.

Estimating Probabilities
POS tagging using HMM, the probabilities are calculated from a tagged training corpus instead of using the full power of HMM learning. The probabilities of tag transition P(ti|ti-1) is the probability of a tag given in the previous tag. Estimation of transition probability is computed by counting the times that the first tag in a tagged corpus, how often the first tag is followed by the second.
The emission probabilities, P(wi|ti) given a tag, it will be associated with a given word [23]. The emission probability is

Joint Myanmar Word Segmentation and POS Tagging
The input sentences are firstly separated by pote-ma "။". The words in each sentence is segmented and assigned POS with the proposed tagsets in Table 1 by using HMM probabilistic models. In Myanmar Language, since words are formed by combining more than one syllable that is one word can have one or more syllables and one syllable has more than one character, syllable identification must be done before word level segmentation [24]. For example, the input is as follows in Table 3: က ြာပန ်းသည ရေထဲ တွ င ရပေါ သည ။ (Lotus grows in water.) After Syllable Identification, the right output is come out as follows: က ြာ|ပန ်း|သည |ရေ|ထဲ |တွ င |ရပေါ |သည A typical strategy to do word division and POS simultaneously is to utilize the N-gram (5-grams) which sweeps an information sentence from left to right, and recover the word with its everything potential labels with the likelihood from emanation record.
If all 5-grams words have not been contained in the emission probability file, the system used 4-grams, trigrams, bigrams and unigram.

က ြာပန ်း ၊သည ၊ရေ၊ထဲ တွ င ၊ရပေါ ၊သည
Word probabilities and language model probabilities is calculated by using relative frequency count. If there are more than one POS options for word, the system selected POS option with highest word probability as described in Table 4.

Morphological Rules Approach
The internal structures of words are defined by using morphological rules [11]. These rules comprise of three sections: prefix (အ), stem and suffix (မ ်း).
The common syntax is as follows: prefix + stem + suffix  POS tag In the above syntax, sometime both of prefix and suffix are contain in the string. In some syntax, one of prefix or suffix is empty string. There are three types' morphological rules for Myanmar Language: inflectional, derivational rules and compounding. In this system, morphological rules (68 rules) are characterized [25] and utilized. The rules are drawn out from Myanmar Grammar book [12][13][14][15][16]. The uses of inflectional, derivational and compounding are described in Section 3.

EVALUATION
To appraise the testing result for POS labeling, the framework utilized the parameters of Recall, Precision and F-score. These parameters are characterized as pursues: , = ℎ ℎ , = ℎ ℎ , = 2 +

Experimental Setup
For testing the proposed model, we divided our corpus into two corpuses as follows in Table 5. We collect 500 new sentences for open testing. In our experiments, we compare the separate word segmentation and POS tagging using HMM , joint word segmentation and POS tagging using HMM and joint word segmentataion and POS tagging using HMM with morphological rules in Table 6. For the comparative purpose, we used Bigram Part-of-Speech Tagger for Myanmar Language [17] as based line system. The proposed system and base line system used same training corpus and test data.   Table 6 shows the experiment results for Myanmar word segmentation and POS tagging with different training data sizes. Conforming to the table, the proposed technique starts to get a few progressions over the correlation standard. When the measure of preparing information sentences is increased and using of morphology rules also has good increased compared with the corresponding baselines. The accuracy of the tagger is appraised by using testing data which contains different kinds of words. Testing words can be defined as known words, unknown words and ambiguous words for the tagger. "Known words" are the words contain in the training corpus and "Unknown Words" are the words which are not containing in the training corpus. "Ambiguous words" are the known words which are tagged wrong because of segmentation error and it is needful to solve for disambiguating that tag is the correct tag for these words. In proposed system, most "Unknown Words" occur in Proper Noun (name of person, name of location), different position of Particle and Postpositional marker in segmentation can cause ambiguous in POS tagging. There is no training data to cover all Proper Nouns. Including of disambiguous words and unknown words make decrease in the performance of the tagger. To solve the disambiguation of ambiguous words is to use the morphological rules. By using morphological rules, the system reduced ambiguous in Particle and Postpositional markers.

CONCLUSION
This paper presents a joint word segmentation and POS tagging in Myanmar using HMM and morphological rules. In our experiments, we compare the separate word segmentation and POS tagging with our proposed joint word segmentation and POS tagging using HMM. Then, we found that there is a significant improvement in joint word segmentataion and POS tagging using HMM with morphological rules. We also describe the distribution of words in the corpus. Until now, there are unknown words in our experiments. The future work will be to improve the exactness of word segmentation and POS tagging. We also need a larger corpus for training. By using a large training and morphological rules, the assignment of POS tag will be more accurate and will be reduced the unknown words, incorrect tag and ambiguous words. The paper has shown that word segmentation and POS tagging in Myanmar can be improved by using lager training corpus and combining the morphological analysis of Myanmar Language.