A Context-based Numeral Reading Technique for Text to Speech Systems

ABSTRACT


INTRODUCTION
The goal of speech synthesis is to develop a machine having an intelligible, natural sounding voice for conveying information to the users in a desired voice, language and accent [1], [2]. Research in the area of speech synthesis is a multi-disciplinary field with applications from acoustic phonetics (speech production and perception) [3] over morphology (pronunciation) [4] and syntax (parts of speech, grammar) [5], to speech signal processing (synthesis) [6]. Recent research in the area of Speech and Language Processing enables machines to speak naturally like humans [7]. A Text-to-Speech (TTS) system in this aspect converts natural language text into its corresponding speech [8]. The intelligible speech synthesis systems have a widespread area of applications in developing human-computer interactive system [9] like, talking computer systems [10], talking toys [11], etc. Speech synthesis, combined with speech recognition, allows for interaction with mobile devices via natural language processing interfaces. [12] Analyzing the input text and converting it into a computer readable form for obtaining the appropriate pronunciation plays an important role in appropriate speech unit production and for its understandability by the listeners [13]. Text analysis is the front end language processor of the TTS system [14], which accepts input text, analyzes it and organizes into manageable list of words [15].
An input text may contain symbols (double quote, comma, report, etc), numbers, abbreviations or special symbols [16]. Text normalization involves transformation of the raw input text into the equivalent of written words [17]. It also involves converting all letters of lowercase or upper case, removing punctuations, accent marks, stop words or too common words (like "Don't" vs. "Do not", "I'm" vs. "I am", "Can't" vs. "cannot", etc). Sentences are a group of word segments and these segments may be an acronym, a single word or a numeral [18]. While for the abbreviations or acronyms, one time normalized pronunciations may be maintained, the pronunciation of numerals varies depending on the context of its use in the sentence or word [19].
A number may be pronounced differently in different situations and needs to be converted into their appropriate pronounceable forms to produce the desired speech outputs [18]. Table 1 shows some example pronunciation of the English numerals in different situations. In this aspect most of the foreign languages like English are well researched [19], where the pronunciation rules are simpler due to the occurrence of pronunciation repetitions after 20. (e.g.: Twenty one, twenty two, …, Thirty one, Thirty two,…etc). However, the Indian language TTS techniques sill presents gap for its acceptance by the users due to the unavailability of appropriate pronunciation rules. The probability of repetition of pronunciation is relatively very less in Indian languages at word level (e.g: pronunciations of the numbers in Hindi: 21-"ik-kis", 22-"baa-is", 23-"tei-s", etc.). This increases the complexity of the numeral reading module. Therefore, most of the researchers use simple digit based reading models that stores the recorded units for single digits from 0 to 9 for producing the desired output speech but did not address the context based numeral reading. However, context based numeral reading plays an important role to enhance the understandability of the produced speech. It is always easier to understand the price of some item if it is pronounced based on position based reading like "fifty five thousand five hundred" instead of pronouncing "five five five zero zero". The focus of this paper is to address the context based numeral pronunciation in Indian language scenario. There are only fewer models documented for speech synthesis in Indian languages [20]- [24], however the context dependent numeral pronunciations has not been well addressed [25]- [28]. The dhvani TTS system for Indian language [25], maintains the pronunciations of numerals up to hundred as the phonetic representation and use the position pronunciations for 'hundred', 'thousand', etc positions attached to the "up to hundred pronunciation" for reading the numerals. However, the context dependent numeral reading aspect is not considered in speech production. A rule-based numeral reading method is presented in [18] for the Odia language.
In this paper, we present a pronunciation rule based approach for the up to hundred pronunciations and incorporate it with the waveform concatenation technique (WCT) [29] to produce output speech for Indian language numerals. Also, the context dependent numeral pronunciation aspects of the numerals are considered to produce natural speech segments to increase the understandability. A set of experiments are performed to evaluate the performance of the proposed model compared to the existing syllable based technique with respect to different contexts of numeral pronunciation. And the results obtained, shows the effectiveness of the proposed technique compared to the existing technique in different contexts. The remainder of the paper is organized as follows. In the next section, we discussed about the waveform concatenation technique as the proposed numeral reading module is incorporated into the rule based concatenative approach. Section 3 describes the details about the proposed model and the context dependent numeral pronunciation rules. The experimental methodology and result analysis for our technique is given in section 4, showing the effectiveness of this technique in producing intelligible speech. Section 5 concludes the discussion, explaining the findings of our experiments and the future directions of this work, where further work may be undertaken.

WAVEFORM CONCATENATION TECHNIQUE (WCT)
As compared to English, most of the Indian languages have approximately twice as many vowels and consonants along with a number of possible conjunct characters formed by combination of two or more characters [28]. Therefore, a large number of speech units are needed to be stored in the speech database while a concatenative speech synthesis technique is used for producing uninterrupted speech. However, WCT [29] uses only 35 basic speech units of the consonant (C) and vowel (V) sounds instead of storing all required speech units in the database, and derive all other units using a rule based waveform concatenation technique. The list of 35 basic speech units are listed in Table 2. 4535 For producing the output speech for the required speech segments, a fraction-based waveform concatenation technique is used. The fraction duarions are determined dynamically from the speech data based on the vowel onset point identification technique [29]. These fractions durations are considered for the waveform concatenation process to obtain the desired speech units. While the rule-based concatenative technique (RCT) [28] uses a static fraction duration for concatenation the use of dynamic fraction durations in WCT [29] enhances the quality of speech being produced. This fraction based concatenation process is considered for the dependent type of unit pairs such as Consonants attached to Matra/Fala/Halant/Consonants and the whole wavedata is used for producing the independent unit pairs like Consonants attached to Consonant/ Vowel, Vowels attacched to Consonants/Vowels. Figure 1 shows the portion based waveform concatenation process to produce the sound "\re" from "\ra" and "\ae" using the WCT technique. Table 2. Speech units in Database Figure 1. Wave pattern of "/re" (C-M) sound after concatenating portions from /ra and /ae sound

PROPOSED MODEL
In this section, we present a pronunciation rule-based technique for producing speech segments for the Indian language numerals by identifying the phoneme level similarities in the numeral pronunciations in the three considered languages (Odia, Hindi, and Bengali). Figure 2 shows the overview of the proposed numeral reading module and the details of the phases are discussed next. However, first a context identification process is performed to identify the context of the numeral pronunciation as discussed next.

Context Dependent Numeral Pronunciation
The context dependent numeral pronunciation is an important issue for producing meaningful speech samples for the numerals. The simple digit reading technique may not provide the desired understandability in all situations. For example, while reading a larger quantity or price say 1,54,954 by simply reading the digits as "one-five-four-nine-five-four" makes the listener think to rearrange the numbers to understand the spoken price or quantity; appropriate pronunciation as one-lakh, fifty four-thousand, ninehundred, fifty-four may make some sense to the listeners. The similar variation of pronunciation also extends to the Indian languages. Table. 3 show some example numerals and their pronunciation in different context in English and Odia language. A number in different Indian languages may be pronounced by simply reading the digits while mean for a quantity [21], phone number or credit card number, etc.; the number may be read by the relationship with its positions while meant for a price indicator or year. In case of a fraction value the left part before the decimal point is read based on the relationship between the position of the character and the numbers after the period are read as single digits. While reading a date people always read as "aek-tin-dui-hajaar-sohala" for the date "01-03-2016" in triplet format (dd-mm-yyyy or dd/mm/yyyy). Also, for reading a time interval separated by a colon the format is different for the number before the colon and after the colon. To incorporate all the considered variation of a numeral pronunciation a set of manually coded rules are prepared. The context identification rues are presented below, where n is the number of digits in the number and d i is the ith digit in the number.
Context dependent pronunciation rules: Rule 1: IF n >=10 AND no separation in between THEN perform digit reading Rule 2: IF n >=10 AND di separated by "," THEN perform position based digit reading Rule 3: IF digits separated by "-"or "/" in a triplet format THEN perform date format digit reading Rule 4: IF digits separated by "-" THEN perform digit reading Rule 5: IF digits separated by ":" THEN perform time format digit reading (digit reading for digits before ":" and position based digits reading for digits after ":" Rule 6: IF digits separated by "." THEN Perform position based digit reading for digits before"." and digit reading for digits after "." Rule 7:

Pronunciation Rules
As the proposed technique for speech synthesis stores only some basic speech units and produce all the sounds from these basic units based on some specified rules, the pronounceable units for numerals are needed to be identified and mapped to the respective character equivalents for the sounds to produce the desired output speech. Also, there is no generalized rule available for the pronunciation of numbers up to 100. However, for numbers greater than 100, a repetition of pronunciation may occur (e.g.: 122 "ek soubaais", 123-"ek sou-teis", etc in Hindi language. Therefore, the numerals after 100 may be formed by concatenating the 100th 1000th,…etc place pronunciations with their respective up to 100 pronunciations. We prepare a set of pronunciation rules for obtaining the up to 100 pronunciations. The pronounceable unit identification process is discussed below.
The numbers from 1-9 and all 10th position pronunciations are needed to be maintained for performing single digit reading. The pronunciations of the numerals from 1-9 and 10th positions are presented in Table 4 and Table 5 respectively for the considered languages. Also, there may be a similarity noticed in the pronunciations of the numerals in the three considered languages.   Tish  Tirish  40  Chaalish  Chaallish  chaallish  50  pachaash  pachaash  ponchaash  60  saathiae  saatth  Shaat  70  saathiae  sattar  shottor  80  asi  ashi  ashi  90  nabe  nabe  nobboi  100  sahe  sau  Sho  1000  hajaare  hajar  hajaar  100000  lakhya  laakh  laksh  10000000 koti karod koti As in Indian languages, the probability of repetition of pronunciation is relatively very less at word level, we try to derive the pronunciation similarities at phoneme level for the up to 100 pronunciations. For example, when the numeral 2 is present at unit or 10th place it has one type of pronunciation at beginning or end. The pronunciation similarities in the three considered languages for 2 at unit and 10th place are presented in Table 6 and Table 7 respectively. Considering such similarities in pronunciations a set of similarity rules are prepared for the pronunciations for numerals from 11-99. The pronunciations of numerals with starting and ending similarities in the three considered languages are presented in Table 8, Table 9 and Table 10 respectively for Odia, Hindi, and Bengali language and for each of the language and similarity three states are maintained at phoneme level as shown in Figure 3 and based on proper match the respective pronounceable units are extracted from the speech database to produce the output speech.  To derive all the pronunciations for the numerals, we have prepared different groups considering the above discussed similarities. We have classified the pronounceable units to be into three states of groups as: Begin state (B), Middle state (M) and End state (E). Depending on the position of the number i.e. unit or 10 th , the states are determined and the pronunciation is derived. For example for obtaining the pronunciation of a number N having length L, as {n 1 , n 2 , …n L }, There exist 3 states representatives of the pronunciations, {B, M, E} for the unit and 10 th positions, n L and n L-1 respectively and the units from n 1 to n L-1 may be derived using the common pronunciation rules by concatenating 100 th , 1000 th , etc position's pronunciation with the up to 100 pronunciation. For example, in producing the pronunciation of the numeral 11 as "ek-ga-ra" in 4539 Odia language the units involved in the pronunciation are "\ek" from "B set", "\ga" from "M set" and "\ra" from "E set" as shown in Figure 3. i.e B(L)-M(L-1)-E(L-1). However the above repetitive pronunciation is not same for numbers having a 9 or 0 at the unit place. To overcome this we separate the numbers of these category from the groups and produce their pronunciation by maintaining special cases of pronunciation as {9: "na", 19: "une-is", 29: "ana-tiris", etc}, {20: "ko-die", 30: "tiris", etc} However, for some units, the pronunciation rules does not include the middle state, for example for the numeral two at the unit place, the mapping may be "\ba" from the B state and the next one is from the E state as "\ra" to form the pronunciation "ba-ra" for the numeral "12".  The upto 100 pronunciation may fail for certain numerals, e.g: consider the numeral 14 pronounced as "chau-da". This does not follow the pronunciation similarities. An obvious (brute force) workaround is to have a small dictionary of such dis-similar units, and check whether a given number matches any of them at the beginning of text analysis phase. If so, break it up into the corresponding pronounceable units separately and parse them to the next phase separately. This works satisfactorily, and we've implemented this with a few numerals (14-"chau-da", 16-"so-ha-la", 35-"pain-tir-is", 53-"te-pan", 56-"chha-pan", etc).

Speech Database Mapping And Waveform Concatenation
As the model uses the WCT technique to produce the desired output speech, the respective base sound units in the speech database are needed to be obtained for performing waveform concatenation to produce the output speech. The speech database mapping phase identifies the respective speech database units to perform rule based waveform concatenation. The WCT technique is then used to produce the desired  Figure 4 shows the portion concatenation process for the numeral "1" pronounced as "ae-ka" in Odia language from the two speech database units"\ae" and "\ka".

ILLUSTRATION
The context identification process for an Odia language numeral is presented in Figure 5 and the speech unit identification/mapping step involved for producing the numeral pronunciation from the base 35 speech units is presented in Figure 6. In producing the output speech for the numerals, the discussed up to 100 pronunciation rule is used to find the equivalent character units involved. The same portion concatenation method is used to produce the final output speech.

RESULT ANALYSIS
The proposed numeral reading technique and the WCT technique is implemented in C/C++ and is being tested for producing different types of numerals in different context in the considered Indian languages.
To analyze the quality of the produced speech, the Mean Opinion Score (MOS) test [30] is considered along with the storage and execution time with respect to the existing syllable based text to speech technique [19]. The details of the results obtained are discussed below.

Storage Requirement
While the syllable based techniques requires around 800 speech units of syllable units requiring a memory of around 1-2 MB in compressed format, the WCT technique that produces the speech segments from the basic 35 speech units requiring a memory of around 235 KB only without further compressions. No other units are required to be added to the database for producing the numeral pronunciation in different

Execution Time
To analyze the performance of the proposed technique in terms of execution time compared to the syllable based technique, different text files are prepared containing numerals in different context of its use. By varying the number of numerals in each file from 10 to 100, the execution time (in ms) is measured by both the techniques. Figure 10, Figure 11 and Figure 12 shows the average execution time for both the techniques. The results all the experiments performed shows the exponential increase of execution time, due to the increase in number of decompression to the .gsm files in the syllable based technique, while the proposed approach shows relatively very low growth rate in all the scenarios tested.

Subjective Measure For Speech Quality
For performing the MOS tests, a set of random numerals, N 1 , N 2 ….N 8 are selected representing different category of pronunciations for the specified rules. The output speech is generated by the proposed technique as well as by the syllable-based numeral reading technique. A group of 15 native speakers are selected from each language to perform the listeners test and are asked to give their feedback on the basis of ease of understandability on output speech produced by the two techniques in a 5 point scale (1-very low, 2low, 3-average, 4-high, 5: very high). All the tests were performed with a headphone set. Figure.7, Fig. 8 and Fig. 9 shows the average MOS test results by all listeners for different numerals respectively for the three considered languages. The results of all the experiments performed show the effectiveness of the proposed technique in producing comparable results with the existing technique even with a very small database.

CONCLUSIONS
In this paper, a context based numeral reading technique is presented for Indian language text to speech systems. The proposed pronunciation rule based model is incorporated with the WCT technique to produce the desired output speech. To evaluate the performance of the proposed technique, a set of experiments were performed to show the effectiveness of the technique compared to the existing syllablebased technique. The subjective measure analysis shows the effectiveness of the proposed technique in producing intelligible speech compared to the existing technique, even with a very small speech database of 35 basic units only. The average execution time required by the proposed technique is also very less compared to the exiting technique. However, the model provides the pronunciation rules for three Indian languages only while the same level of similarity may be observed in other Indian languages. Therefore, the model may further be enhanced to work for other Indian languages. Also, some smoothening techniques may be applied at the concatenation points to further enhance the quality of speech being produce to make it more natural sounding.