Source side pre-ordering using recurrent neural networks for English-Myanmar machine translation

Received Aug 5, 2020 Revised Mar 13, 2021 Accepted Mar 24, 2021 Word reordering has remained one of the challenging problems for machine translation when translating between language pairs with different word orders e.g. English and Myanmar. Without reordering between these languages, a source sentence may be translated directly with similar word order and translation can not be meaningful. Myanmar is a subject-objectverb (SOV) language and an effective reordering is essential for translation. In this paper, we applied a pre-ordering approach using recurrent neural networks to pre-order words of the source Myanmar sentence into target English’s word order. This neural pre-ordering model is automatically derived from parallel word-aligned data with syntactic and lexical features based on dependency parse trees of the source sentences. This can generate arbitrary permutations that may be non-local on the sentence and can be combined into English-Myanmar machine translation. We exploited the model to reorder English sentences into Myanmar-like word order as a preprocessing stage for machine translation, obtaining improvements quality comparable to baseline rule-based pre-ordering approach on asian language treebank (ALT) corpus.


INTRODUCTION
Machine translation (MT) system has broadly focused on word ordering and translation in translation output. Generating a proper word order of the target language has become one of the main problems for MT. Phrase-based statistical machine translation (PBSMT) system relies on the target language model and can produce short-distance reordering. They have still difficulties related to long-distance reordering that happen between different language pairs. For empirical reasons, all decoders for PBSMT limit the amount of reordering and are unable to produce exact translations when the required movement is over a large distance. Moreover, the different syntactic structures have challenged to generate syntactically and semantically correct word order sequences for both efficiency and translation quality. Hence, the quality of PBSMT is enhanced by reordering the words in the source sentences like the target word orders as a preprocessing phase in MT.
The pre-ordering method has been successfully employed in various language pairs between English and subject-object-verb (SOV) languages such as Japanese, Urdu, Hindi, Turkish, and Korean [1]. The major issue is that it needs a high-level quality parser that may not be obtainable for the source language and considerable linguistic expertise for generating the reordering rules [2]. This method can be categorized by pre-ordering rules that manually hand-coded rules on linguistic knowledge, or learned automatically from data [3], [4]. The approaches have the advantages that they are independent of the actual MT system used, are also fast to apply, and tend to decrease the time spent in actual decoding. Linguistically, the syntax-based pre-ordering method requires parse trees: the constituency or dependency parse trees that more powerful in capturing the sentence structure.
In this study, we employed a pre-ordering approach using recurrent neural networks (RNN) based on rich dependency syntax features for reordering the words in the source sentence according to the target word order. This approach is a sequence prediction machine learning task and produces permutations that may be long-range on the source sentences. The main purpose of this work is to construct English-to-Myanmar pre-ordering system to develop statistical machine translation (SMT). No work has been done preordering using neural networks for English-to-Myanmar language. But we can't conduct the experiment on Myanmar-to-English because parsers are unavailable for Myanmar language.
The rest of the paper is designed is being as. Some representative workings on reordering approaches are illustrated in section 2, and section 3 presents Myanmar language and reordering issues for English-Myanmar translation. Section 4 describes recurrent neural networks pre-ordering model and section 5 presents details of the experiments and section 6 explains results and discussion. Finally, in section 7, we present the conclusion of this paper.

LITERATURE REVIEW
Several reordering methods have been proposed to address the reordering problems. The previous reordering on Myanmar was done by applying rule-based and statistical approaches. The preceding research made use of rules with small parallel sentences. Wai et al. [5] described automatic reordering rules generation for English to Myanmar translation applying part-of-speech (POS) tags and function tags rule extraction algorithms. First order Markov theory was used to implement a stochastic reordering model. Win [6] presented clause level reordering (CLR) technique by applying POS rules, article checking algorithm, and a pattern matching algorithm for Myanmar to English reordering. In this study, translated target English words were reordered to make the proper order of the target sentences by using 14 test sentence patterns.
The automatic rules learning approach for English-Myanmar pre-ordering was proposed in [7]. The training method extracted the preordering rules from dependency parse trees to decrease the number of alignment crossings in the training corpus. The results showed that this approach gained improvement by 1.8 BLEU points on ALT data over the baseline PBSMT. The dependency-based head finalization pre-ordering method for English-Myanmar was studied in [8]. The dependency structures are used to get the head finalization because the approach jumps a head word after all its modifiers. In this analysis, the source English sentences were reordered before neural machine translation (NMT) and statistical machine translation (SMT) translation systems. Pre-ordering improved translation task by 0.2 BLEU in the PBSMT but it did not improve in the NMT system.
Currently, reordering methods has received lots of observation by researchers. The earlier works used a source parser and manual reordering rules. Izozaki et al. [9] introduced a simple rule-based preordering system to reorder in an English-to-Japanese task with a parser annotating syntactical heads. The rules were created by constituency tree using Enju, a head driven phrase structure grammar parser. Some works have used automatic reordering rules learning applying syntactic features of the parse trees for machine translation. Genzel [10] and Cai et al. [11] proposed the approaches which learn unlexicalized reordering rules from automatic word-aligned data by minimizing the amount of crossing alignments. The rules are automatically learned with a sequence of POS tags by dependency parse trees. These can capture several types of reordering occurrences. Each rule has a vector of features for a vertex with a context of its child nodes and defines the permutation of the child nodes.
Jehl et al. [12] developed a pre-ordering approach using a logistic regression model that predicts whether two child nodes in the source parse tree need to swap or not. This model used a feature-rich representation and a depth-first branch-and-bound search to make reordering predictions. They looked for the best permutation of its child nodes given the pairwise scores made by the model. Kawan et al. [13] proposed a pre-ordering method with recursive neural network (RvNN) that is free of manual feature design and made use of information in sub-trees. The method learned whether to reorder the nodes with a vector representation of sub-trees. The method achieved better translation quality compared with the top-down BTG preordering method on English-Japanese translation.
Pre-ordering as a visiting on the dependency trees, conduct by a recurrent neural network was introduced in [14]. This method applied a neural reordering model joining with beam search for multiple ordering alternatives and compared with other phrase-based reordering models. Hadiwinoto  4515 developed a reordering method exploiting sparse features related to dependency word pairs and dependencybased embedding to predict whether the translation outputs of two source words connected by a dependency relation should be remained in a similar order or swapped in the target sentence. Gispert et al. [16] used a feed-forward neural networks to result in faster the execution and accuracy of pre-ordering in MT. This task was applied in various language pairs and gave acceptable results compared with the state-of-the-art techniques.

MYANMAR LANGUAGE
Myanmar language is complex, rich morphology, and ambiguity. The sentences are made up of one or more phrases as verb phrases, and noun phrases, or clauses. Morphologically, it is analytic language without the inflection of morphemes. Syntactically, it is usually the head-final language that the functional morphemes follow content morphemes, and the verb always becomes at the end of a sentence [17]. The sentences are delimited by a sentence boundary marker, but phrases and words are rarely delimited with spaces. There are no guidelines on how to place the spaces in the sentences. So, word segmentation is needed to detect word boundary in Myanmar text and this segmentation result affects reordering and translation performance.

Reordering issues in English-Myanmar translation
Myanmar is SOV (or) OSV word order with the post-verbal position: the verbs and its inflections follow the arguments (subject, object, and indirect object), adverbials, and postpositional phrases. There are various syntactical differences not only in word-level but also in phrase-level. English is SVO language and puts auxiliary verbs close to the main verbs. Myanmar word orders differ from English mostly in the noun phrase, prepositional phrase, and verb phrase. Myanmar uses postpositionally inflected by many grammatical features but English is prepositions. In addition, there are multiple word orders in Myanmar sentence for one English sentence. This can be illustrated with an example English sentenc "He carries the baggage to the train" and all Myanmar translation are valid sentences in Table 1. Without reordering between the languages, words in English sentence are directly translated and translation can't be meaningful. So, reordering is one of the problems for English-Myanmar translation.

PRE-ORDERING MODEL 4.1. Dependency parsing
Dependency parsing can be defined as a directed graph over the words of a sentence to show the syntactic relations between the words in Figure 1. A parse tree rooted at the main verb as the head word with the subject (nouns or pronouns) and object as direct modifiers, subordinate verbs, and their adjectives as own modifiers and so on, resulting in hierarchical head-modifier relations. The edges are labeled by the syntactic relationships between a head word and a modifier word [18]. Each sub-tree is an adjacent substring of the sentence. There are two kinds of commonly used dependency parsers: Conference on natural language learning (CoNLL) typed dependency parser and stanford typed dependency parser [19].

Reordering on a dependency parse tree
Reordering is performed using a syntax-based method which operates source side dependency parse trees and is restricted to swaps between the nodes that are in parent-child or sibling relations. Let a source sentence be a list of words = { 1 , 2 , … . . , } annotated by a dependency parse tree. Reordering is performed by passing the dependency tree beginning at the root. The process is defined that runs from word to word across the vertices of the parse tree and probably at each stage produces the current word that each word must be produced exactly once. The final sentence of the produced words ' is the permutation of the sentence and any permutation can be produced by a suitable path on the parse tree. This reordering framework performs a sequence of movements on the trees and carries on until all the nodes have been produced. Figure 2 shows a bilingual parallel sentence with word alignments and the equivalent pre-ordered source sentence. The source English sentence is "Three people were killed due to the storm." and Myanmar

Recurrent neural networks pre-ordering model
Recurrent neural networks have been used to model long-range dependencies among words as they are not restricted to a fixed context length, like feed-forward neural networks. It allows easy incorporation of many features and can be trained on sequential data with varying lengths. This pre-ordering system is modeled using the syntax-based features and restricted to reorder between a pair of nodes that are in parentchild or siblings relations. Let f = {f1, f2…....,fl } be a source sentence and the state transition of the recurrent neural networks is defined as: where ( ) ∈ is an input feature vector related to the t-th word at step t in a permutation ′ and θ 1 and θ 2 are parameters.
ℎ( ) is hyperbolic tangent function and input features are encoded with "one-hot" encoding. The logistic softmax function is used for normalization of scores:

4517
The probability of the total permutation ′ can be computed by multiplying the probabilities of each word: For decoding, recurrent neural network (RNN) pre-ordering model is needed to compute the most likely permutation ' of the source sentence. A local optimum problem is solved using a beam search strategy. The permutation is generated from left to right incrementally starting from an empty string and is generated all probable successor states at each step.
Two types of feature structures are used for RNN training such as lexicalized and unlexicalized features. In the unlexicalized structure, the input feature function is made of the following. In unigram features, POS tags and dependency tags of current word, left and right child of dependency parent, and multiplication of these tags. Bigram features also contain a relationship between current word and previous released word in the permutation. All features are encoded by one-hot encoding. The features of lexicalized structure just contain the surface form of each word.

Corpus statistics
Myanmar Language is one of the low resource languages and manual aligned parallel large corpora are extremely rare. We used English-Myanmar parallel corpus from the asian language treebank (ALT) project [20] for training and testing of the pre-ordering model. The corpus consists of about 20 K parallel sentences from English wiki news and is translated into other languages, such as English, Bengali, Filipino, Hindi, Bahasa Indonesia, Japanese, Lao, Khmer, Malay, Myanmar, Thai, Vietnamese, and Chinese. Most of the sentences in the ALT corpus are very long, complex, and has some idioms. This is randomly divided into 19,098 sentences for training, 500 sentences for development and 200 sentences for testing to evaluate the reordering systems as shown in Table 2. The average English sentence length is about 15 words per sentence.

Training the RNN pre-ordering model
We used the pre-ordering framework of Miceli Barone [14] and tried to permute the word order of English sentences to that of the Myanmar sentences by adjusting the English word order. Manual word alignments are used from the corpus as the quality of alignment tool is still low for English-Myanmar language. But there are alignment errors and missed aligning in this manual process. So, we manually checked to get more correct alignment estimates. English sentences are parsed with Stanford dependency parser [21] in CoNLL format to obtain syntactic dependency features that represents the grammatical structure of English. The CoNLL format shows one word per line and each line represents tab-separated columns with a series of features: index, surface form, lemma, POS tag, head index, and the syntactic relation between the existing word and head word in Figure 3.
The training instances are obtained from the word-aligned corpus and the dependency tree corpus. Then, each source word is assigned to an integer index equivalent to the position of the target word that it is aligned to, joining unaligned words to the subsequent aligned word. The training problem aims to learn the parameters which minimize the cross-entropy function. We utilized the following architecture. Stochastic gradient descent is calculated using automatic differentiation of Theano that implements a backpropagation through time. L2-regularization and learning rates with the AdaDelta are used. Gradient clipping is used to avoid the exploding gradient problem. We exploited RNN pre-ordering approach with lexicalized (Lex-RNN) and unlexicalized (Unlex-RNN) Figure 3. Dependency features in the CoNLL format

Pre-ordering baseline
For baseline rule-based pre-ordering training, Otedama, an open-source pre-ordering tool [22], was used for comparison. Hyper-parameters are set is being as: the matching features is the maximum value of 10, the window size is 3, and the maximum waiting time is set to 30 minutes. The rules learning contains POS tags and dependency relations of a current node, a sequence of child nodes, and the parent of this node.

Reordering metrics
Current metrics calculate on indirect approaches for measuring the quality of the word order, and their capability to capture reordering performance has been demonstrated to be poor. To evaluate the preordering models, we compared the reference permutations from manual word alignments to the hypothesis permutations generated from the models. There are different ways of measuring permutation distance. Hamming distance was presented by Ronald [23] and is known as the precise match permutation distance. It can perform to capture the number of disorders and compare the amount of disagreements between two permutations.
Kendall's tau distance [24] is the minimum number of swaps for two adjoining words to convert one permutation into another. This is named as the swap distance and can be taken as a probability of detecting discordant and concordant pairs. This metric is the most attractive for measuring word order as it is sensitive to all relative orderings. We also utilized a combined metric or LRscore [25] that was presented to evaluate reordering performance. The metric applied the distance scores in combining with BLEU to calculate the order of translation output. The combined metric connects with human assessment than BLEU only.

RESULTS AND DISCUSSION
The proposed pre-ordering model mainly tested for English-to-Myanmar pre-ordering. Table 3 shows the results of pre-ordering experiments in term of monolingual bilingual evaluation understudy (BLEU), Hamming distance, Kendall's tau distance, LRscore of 4-gram BLEU on Kendall's tau distance calculation (LR-KB4) and LRscore of 4-gram BLEU on Hamming distance calculation (LR-HB4) by comparing the reference reordering to the reordered output sentences for the two RNN pre-ordering and rulebased pre-ordering methods. The results proved that the two RNN pre-ordering methods are higher than the rule-based pre-ordering and then Unlex-RNN achieves improvement by 0.54 BLEU score over the rule-based method. Unexpectedly, the results showed that Unlex. RNN made worse than the Lex-RNN. This contradicts to expectation as neural models are commonly lexicalized features but often use unlexicalized features. We also observed that dependency parsing and word alignments accuracy effects on pre-ordering model training. We also compared the results in the pre-ordered sentences by the crossing score [26]. The score computes the number of crossing links from the aligned sentence pairs. We would like this score to be zero, meaning a monotonic parallel corpus. If the corpus is more monotonic, the translation models will be better and decoding will be faster as less distortion will be needed. It can be seen that the crossing scores percentage remain after applying the pre-ordering models in Table 4. Unlex-RNN and Otedama achieved crossing score reductions of about 30 percentage compared to the original crossing score (baseline). We also found that the neural pre-ordering method outperforms Otedama, achieving near 5 percentage points reduction. The lower point is better. In the proposed RNN pre-ordering model, pre-translation reordering is carried out in the source language side. Figure 4 illustrates pre-ordering examples applying different pre-ordering approaches: Unlex-RNN pre-ordering and rule-based pre-ordering. It is found that the training data is long sentences and the sentences which have long word length may have more errors than those of short word length. Although the pre-ordering results of examples are not the same to the references exactly, their crossing scores reduced over the baseline.

CONCLUSION
To solve the reordering problem in English-Myanmar language pair, recurrent neural networks preordering model is introduced. The model is trained with the machine learning approach and can generate arbitrary permutations without manual reordering rules. This studying of pre-ordering is the first work of RNN architectures on English-Myanmar pre-ordering. This approach combined the strength of lexical and syntactic reordering using the structure of the dependency parse trees and rich features. We utilized the effectiveness of neural networks and made a comparison between neural networks and rule-based on reordered source sentences. Experimental results showed that the performance of neural networks is quite promising in English-to-Myanmar pre-ordering. Currently, we primarily focus our experiments on English pre-ordering for English-Myanmar SMT. With better word alignments, better correct parsing, and more training, better reordering results can be achieved in the future as this work depends on the correct parsing and better word alignments. We will extend to incorporate the pre-ordering model to statistical machine translation.