APMorph: finite-state transducer for Amazigh pronominal morphology

Received Dec 10, 2019 Revised Jul 14, 2020 Accepted Jul 31, 2020 Our work aims to present an amazigh pronominal morphological analyzer (APMorph) based on xerox’s finite-state transducer (XFST). Our system revolves around a large lexicon named “APlex” including the affixed pronoun to the noun and to the verb and the characteristics relating to each lemma. A set of rules are added to define the inflectional behavior and morphosyntactic links of each entry as well as the relationship between the different lexical units. The implementation and the evaluation of our approach will be detailed within this article. The use of XFST remains a relevant choice in the sense that this platform allows both analysis and generation. The robustness of our system makes it able to be integrated in other applications of natural language processing (NLP) especially spellchecking, machine translation, and machine learning. This paper presents a continuation of our previous works on the automatic processing of Amazigh nouns and verbs.


INTRODUCTION
The Amazigh language or Tamazight is considered as part of Moroccan culture [1,2]. In the past, Tmazight in Morocco was dismissed as an important language. Today a political decision is required for the standardization and the protection of the Amazigh language.
On June 17, 2011, Kingdom of Morocco declared a constitutional reform that Tamazight becomes an official language in the nation and will be used in all the state administrations. With the creation of the Royal Institute of Amazigh Culture (IRCAM), the Amazigh language has developed its own spelling [3], has acquired in the standard Unicode [4][5][6]. Therefore, the Amazigh language was introduced in the public domain, particularly in education, administration, and media. However, all these steps remain insufficient compared to the advancement of Amazigh language in the political, cultural and social fields. This situation requires other efforts in terms of automatic language processing especially morphological analysis.
In the eighties, morphological analysis is a priority step in language processing. Over the years, several concept have been implemented including rule-based [7] and machine learning statistical methods [8,9]. The choice for the morphological analysis method of each language depends mainly on the availability of resources. Therefore, in the curent paper, given the shortage of linguistic resources modeling Amazigh morphology by machine learning methods, we opted for rule-based approach. Our morphological processor is based on XFST tools. The choice of this platform is due to the efficiency of processing numerous languages as several European languages [10][11][12], some Sub American languages [13], and some Asiatic languages as Arabic language [14]. The strength of XFST technology is the ability to process and manage the characteristics of each language whether inflectional, derivational concatenative, nonconcatenative, or agglutinative language. For the Amazigh language, the main work done at various institutes in Morocco are based on the finite-state theory using two platforms Nooj and XFST:  For the NOOJ platform, the authors worked on the simultaneous analysis of nouns, verbs and autonomous pronouns through a single morphological analyzer [15].
 For XFST platform, the researchers opted for an individual morphological analysis of two lexical units noun and verb [16][17][18]. The present work follows my previous articles on morphological analysis of Amazigh language especially for two lexical units: nouns [17] and verbs [18]. These two works are considered as an important step and a prerequisite towards a general morphological analyzer capable of processing all the Amazigh lexical units. In the same optic, thees present work is a new step of lexical morphological analysis by building a morphological analyzer of affixed Amazigh pronouns. The special feature of our system (APMorph) is its ability to analyze an affixed lexical unit, differently to the existing one, which only handles autonomous units.
The rest of this article is presented in five parts: in part 1, we give an overview of the history of the Amazigh language. The description of affixed pronoun morphology is detailed in part 2. The Section 3 exposes a brief overview outlining the XFST toolkit technology. Our system with results and analysis are exposed in part 4, and the experimental outcomes of this study. In the last part, we talk about the conclusions and the reliability of our concept.

MOROCCAN AMAZIGH LANGUAGE 2.1. Historical overview
The Amazigh language or Tamazight, is among the languages spoken by the Amazigh people in North Africa and Afro-Asiatic (Hamito-Semitic) [19]. It is spoken in Morocco and many other communities in parts of the Niger, Mali, and Burkina Faso. In order to enhance and implement the Morrocan Amazigh language, the IRCAM institute was created on 17 October 2001. Subsequently Tamazight was integrated into the administrative, cultural, and social sectors, with an official alphabet conforming to the unicode consortium [20].
Linguistically, the Amazigh language is based on three dialects spread over three geographical areas: Tarifite in the north, Tamazight in the center and Tachlhite in the south of the country. This linguistic diversity requires the initiation of a standardization process par IRCAM [21], which consists of several steps: adapting a graph and a common basic lexicon, applying the same orthographic rules, the same instructional guidelines, and the same neological forms, and exploiting dialectal variation in order to safeguard the richness of the language.

Amazigh language morphology
With its non-concatenative inflective morphological system, the structure of the Amazigh language is considered similar to that of the Semitic languages. Therefore, a word is characterized by merging a root and a pattern, called interdigitation. The root is defined as a sequence of and the pattern is designed as a group of vowels and consonants concatenated with the suffix or prefexation structure.
The morphological analysis of language language represents a preliminary step, which precedes all the automatic language-processing modules. It consists in determining the grammatical structure of a given word based on morpheme. The main lexical units in this language are: nouns, verbs, pronouns, and function words, which include adverbs, prepositions [23]. In this work, we are interested in the category of particles especially the affixed personal pronouns.

Affixed personal pronouns
In the Amazigh language, the pronoun indicates the word that can replace a noun, nominal group or could be attached to a verb. It represents a nominal group already employed, or designates a person included in the communication. There are two categories of pronouns: autonomous personal pronouns or independent, and affixed personal pronouns. This last category is related to the verb, noun or preposition and will be treated during this work. a. Affixed pronoun to the verb The verb may have as a complement, a personal pronoun direct or indirect object. The affixed pronouns (a direct and indirect form) are placed after the verb in an affirmative sentence. The affixed pronouns to the verb depend on gender (masculine, feminine), number (singular, plural), and person (first, second, third). This dependency generates twenty possible cases as illustrated on the Table 1. Example, pronoun for direct form: ⵥⵕⵉⵖ ⵜ [zrikht] "I saw it" (masculine) or ⵥⵕⵉⵖ ⵜⵜ [zrikhtt] "I saw it" (feminine) in the third person singular. b. Affixed pronouns of the noun These pronouns are always placed after the noun (simple, compound, and derived form) and bind with the possessor but not with the possessed object. They are distinguished from the affixed pronouns to the ordinary nouns and the affixed pronouns to the kinship nouns. Twenty classes for both forms which also depend on gender (masculine, feminine), number (singular, plural), and person (first, second, third). For example, ⴰⵎⵙⴰⵡⴰⴹⵏⵏⵖ [amsawadnngh] "our communication" (Masculine/Feminine, first-person of singular). The Table 2 summarize all possible cases.

OVERVIEW OF FINITE-STATE TRANSDUCER (FST)
Finite-State transducer (FST) is an enhanced finite-state machine. It is a type of finite-state automaton that maps between two sets of symbols. Basically, it can accept or reject a string and transform one string into another see in Figure 1. The FST establishes a relationship between two formal languages: a lexical level and surface level [24]. For example, ⵥⵕⵉⵖ ⵜ [zrikht] in he surface becomes ⵥⵕⵉⵖ+verb+Sg+masc+affix_pronoun_verb (affixed personnel pronoun to verb in masculine singular) in lexical level that means that the system recognizes the word as an affixed pronoun see in Table 3. The sign "+" is added by convention in section 3.

XEROX FINITE-STATE TOOL (XFST) Is one of the most sophisticated tools for construction of applications based on finite-state automata, developed within the XRCE center (Xerox Research Center Europe) by Kenneth R. Beesley and Lauri
Karttunen [25]. It is based on a solid and innovative finite-state technology, designed for versatile use, ranging from segmentation into words and syllables to written texts to the generation of texts through morphological analysis and analysis surface syntax. XFST integrates a set of tools as:

LEXC grammar
Lexc grammar also called lexicon compiler. It is practical in creating lexicons at two levels by the structuring of the morphotactic rules, the management of phonological alterations and the treatment of large irregularities. It is optimized to efficiently process tens thousands of basic shapes generally encountered in natural languages by manipulation huge lexical unions [26]. The structure of lexc grammar is illustrated in Figure 2.

XFST interface
Xfst provides an interactive interface for the creation and manipulation of finite-state machines, allowing it to read finite-state networks from binary files and compile them from text files, expressions regular and substitution [27].

AMAZIGH PRONOMINAL MORPHOLOGICAL ANALYZER (APMORPH) 5.1. Overall design and implementation
Our mission through this work is to implement an intelligent system capable of analyzing Amazigh pronouns based on two levels of abstraction analysis and generation. Our analyzer uses the lexicon of lemmas APMorph as input and morphotactic rules that define the constraints on possible morpheme combinations and the inflection of each word, in order to map the affixed pronoun to the verb and noun to its surface form as illustrated en Figure 3.

Morphotactic: syntax of morpheme
The morphotactics or syntax of morpheme concern especially the ordering restrictions in place on the ordering of morphemes. In other words, it can be translated as "the set of rules that define how morphemes (morpho) can touch (tactics) each other". There are three categories required for each Morphological analyzer:  Morphological categories: These are the three lexical categories used (noun, verb, and pronoun).
The Table 4 illustrates the three categories with their keyword as declared in Lexc.  Grammatical categories: A grammatical or grammatical feature is a property of a grammar of a language.
In our case three categories that are treated Number (sgl, plr), Gender (fem, masc), Person (1pers, 2pers, 3pers) as shown in Table 5 and declared in Lexc.  Affixed pronoun inflexion (class): Refers to the behavior of pronoun with morphological categories and grammatical categories. For example, "affixed pronoun to the verb in direct form in Msc_sgl, affixed pronoun to the ordinary noun in Msc_plr as illustrated in Table 6.

APlex: source lexicon
The source lexicon for the affixed pronoun to the verb and noun is defined in the xfst notation written in a text file named lexicon.txt. Our system will act on a total of 12,000 words for nominal category and 16,000 for verbal category. The total of the words is obtained by multiplying the number of classes/inflections and the total of the given lexical unit as illustrated in the Table 7.

Rules component
The rules are based on two sequences. The first sequence presents the rules morphotactic acting on the internal changes and handle the aspectual inflections. These rules depend on each class and each class concerns a specific phenomenon related mainly to the nature of the word (verb or noun) and to the Amazigh aspect. The combination of the first sequence and the lexicon transducer generates a transducer focuses on constraining morphemes so that they combine correctly in the right order.

Surface input/output
The Finite-state Transducer for Amazigh pronominal morphology tool supports both directions the analysis (upper-side) and generation (lower-side). The analysis allows the processing of lexical units as input. Once successful (the word is known in the network), the transducer generates the lemmas, morphological category, grammatical category and class see in Table 8. In the other direction and on the same network the generation of the outputs depends on the information relating to each lexical unit.

Experimental result
Our transducer has been tested in both directions (surface input and lexical output). The latency time and the recognition rate of the analyzed words make our system robust and useful. By applying our analysis to the words input, the system could recognize 24 440 words of 28000 total words, yielding 87 % of success accuracy (recognition rate) see in Table 9. Unrecognized words are due to the dialectal diversity of our language and to the fact that some of the words do not adhere to all the rules, which can be divided into two categories:

705
 Some proper nouns, which are not supported by our system.  Words imported from the Arabic language example ⵍⴽⴰⵕⴰⴰ [lkaraa] "bottle". The majority of unrecognized forms is due to the standardization process, which is not completed by the linguists and does not currently cover all the categories of words. In order to perform further system evaluation, the set of affixed pronouns related to unstandardized lemmas was, therefore, sent for normalization. Once done, the discarded list will be reinjected into our system for a new analysis and update of the results.

CONCLUSION
In this work, our goal is to build a pronominal system for the morphological analysis of the Amazigh language, which was implemented by adopting for rule-based approach and using the Xerox finite-state technology (XFST) tools. Our system was able to recognize the majority of categories of affixed pronoun to the verbs and nouns (24 440 words of 28000 total words), which proves the success of our approach. Most of rejected words is due to the standardization process, which remains incomplete despite the efforts deployed by linguists. Our future work is to expand/improve our system and make it more intelligent able to cover other lexical units and to remedy the problem caused by the dialictal aspect of the language