Improving keyword extraction in multilingual texts

ABSTRACT


INTRODUCTION
Designing data retrieval systems of large databases is one of the research areas for the application of information technology in the information business. We faced an increasing demand for types of data retrieval systems able to cross the interlingual boundaries, while text data expands in different languages and on the web [1][2][3][4][5][6]. Therefore, by developing the volume of electronic data in various languages, the data retrieval, independent of document languages, has gained importance. The extraction of effective keywords is a time-consuming and human-processing task. Recently, automatic keyword extraction, especially keyword extraction in different languages, introduced an interesting topic for text mining and data retrieval [7][8][9].
The fields of text mining and information retrieval and especially their implementation on the database is of particular importance. The first step is to identify and extract keywords from the texts in the fields. One of the main challenges to extract keywords is existing very diverse languages for contextual information and depending the available keyword extraction methods on the language's type and its verbal structure. The multilingual keywords extraction is the current research problem and the research object is considered based on designing an unsupervised language-independent algorithm to the extraction. So, it is done by focusing on the property of repeating keywords in each text and their intensifying in other texts by utilizing the TF-IDF algorithm.
The rest of the current paper is organized as follows: Section 2 reviews the state-of-the-art keywords extraction methods. The problem of keywords extraction descrids in Section 3. The proposed language  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 10, No. 6, December 2020 : 5909 -5916 5910 independent keywords extraction algorithm and its experimental results are discussed in Section 4 and Section 5. Finally, a conclusion and recommendations are described in section 6.

LITERATURE REVIEW
Several methods were proposed so far for the identification and extraction of keywords, all of which could be classified into two groups of supervised and unsupervised methods [10][11][12]. In the following, we discuss shortly about the proposed methods to realize the probable research challenges. The first group is the supervised methods. In this group, there is a training data set, by learning of which a model is designed and by incorporating this model on new document the phrases will divided into two classes of key and non-key phrases.
The supervised method of word extraction is considered as a clustering problem, which should be trained like a genetic algorithm [13,14]. In Bayes linear algorithm, which is called a keyphrase extraction algorithm (KEA) and proposed by [15], TF-IDF and keyphrase relative distance from the beginning of the text are two algorithm inputs [16]. They also used a binary clustering algorithm that its input features include some references to the text. Decision tree of [17], conditional random field of [18], and a type of KEA in [19] are among other types of supervised word extraction. The functionality of this method is highly dependent on training data and lack of such high quality data could cause an efficiency drop in the system of keyphrase extraction. In this method, the designed model is specific to a domain and works based on the domain of usage.
Another approach to extracting keywords is through unsupervised methods. In these methods, word extraction is dealt with as a ranking issue [20], the most important of which is the TF-IDF. In this method, the relation between the number of a word repetition within a text is calculated according to the number of its repletion in other texts [21]. Graph-based methods are also among the unsupervised methods [22]. The works of [22][23][24] are examples of graph-based methods for word extraction. In unsupervised methods, there is no need for training data and the most important contextual phrases could be extracted by using the ranking strategies. Unlike the supervised methods, the unsupervised methods are applicable for each text to any domain type independent of domain of usage. By the qualitative analysis and comparison of the proposed methods several advantages and disadvantages were found, which could be noted as follows.
The first advantage of the unsupervised methods is their applications in constructing models of any text type and domain. No efficiency drops in case of existence of poor quality data, independently of training data, lower time consumption for keyword extraction, compared with the supervised methods, useful functionality for high-volume data, and high accuracy are among the advantages of the unsupervised methods. In contrast to these advantages, low compatibility is the most tangible shortcoming of these methods. As mentioned previously, there are some disadvantages/advantages of the supervised methods, among which we could refer to the existence of training data with the quality of regular data categorization. However, one of the significant shortcomings of this method is that it is dependent on the training data and lack high-quality data could lead to an efficiency drop of the keyword extraction system, the constructed model is for one domain only, and it acts based on the domain of usage. Providing training data is a time-consuming and laborious task. Moreover, evaluations which are made based on frequency are not applying for high-volume data. One of the challenges of such a method is that providing training data is time-consuming and if such data are not available, the algorithm faces problems and has low efficiency, but it is not the case in the unsupervised method [1,3]. Hence, we employ this method for the proposed algorithm.
Despite the simplicity, TF-IDF algorithm is one of the effective methods for keyword extraction [16,25]. The practical simplicity and efficiency of this algorithm has attracted a considerable attention. A logarithm is proposed for word extraction in the present study to improve TF-IDF. This method is based on TF-IDF, but uses the information of each text in several languages to enhance keyword extraction based on TF-IDF. To implement such an objective, we concentrated on the repetition of words in the context and deleted the conjunctions, prepositions, and verbs. Further, we used simultaneous multilingual information for a certain text, to improve its usage. This process is elucidated in details in the following.

PROBLEM DISCRIPTION OF KEYWORDS EXTRACTION
Data retrieval is used extensively in the everyday life of people. Enhancing efficiency and improving performance is of great importance for the designers of data retrieval systems. As mentioned previously, one way to increase the productivity of data retrieval systems is through the use of statistical plans. In these plans, a frequency is set of keywords, based on which words with the highest frequency are selected as keywords. In aim of the present study was to propose an algorithm, which has the required features, including non-supervisory, language-independent, simplicity, and high speed for processing considerable amount of data. By using the proposed algorithm along with the TF-IDF, which is a statistical, simple, language independent and non-supervisory algorithm, by relying on a sequence of calls with Unicode format, and by designing an online database keyword could be extracted independently of language in large databases.
By assessing the applications of data retrieval and text mining, we could realize that existing keywords within a text play a significant role and facilitate the process in this field. For example, by finding important words in the news and by detecting sentences with more important words, we could extract that sentence in the abstract and better comprehend the text. Since important words are often in headings and important sections, by realizing the structure of a text and by extracting keywords out of these parts, we could get access to these words with a minimum of time. Feed or RSS is used for reading news, which make a news extract available in a structured way by XML format. News reading and saving template are Unicode. For extracting keywords of news texts, we need websites with proper and authentic feed addresses. Hence, we select those feeds, which provide appropriate information. These feeds, however, are selected for every language. After calling information from feeds, they will be saved in a database. Some words are available with high frequency in all texts with no contextual value, like pronouns, adverbs, prepositions, conjunctions, and some frequent verbs. These elements are called public words. By omitting the public words in statistical text mining, we have less calculations and higher efficiency. Words take an equal weight based on their frequency in the document. Actually, this weighting system shows how much a word is important for a document. This fact has no functionality in data retrieval. The weight of a word in a text increases by the number of repetitions in that text, but it is controlled by the number of words in the text. This method is an unsupervised one, which is applied to a simple text. In contrast to the supervised methods, this method does not need the training dataset, in that proving an appropriate training data is a time-consuming and not an easy task and in case the data lack the desired quality, they reduce the efficiency of the supervised keyword extraction system. Figure 1 presents the oerall structure of the proposed algorithm in seven steps, which its detail is discussed as follows: Step 1 (selecting feeds and retrieval): in order to gain access to various documents of different languages, we tried to select the appropriate feeds. Data retrieval of each document, like title or body is carried out in this step. Since our algorithm is language independent, information is read by the unicode format.

THE PROPOSED ALGORITHM
Step 2 (saving document information in the large database): the read information is stored in the database, separately. Data are stored in the Unicode format. This format covers most of Step 3 (word extraction): all words are extracted from the text and omitted in the step related to this action. Every language has a list of repetitive words, which should be deleted from the extracted words.
Step 4 (TF-IDF calculations): TF-IDF calculations are carried out in the step for every text and language and finally the calculated TF-IDF of each text in a different language is used for improving the keywords. In this method, each word has a frequency-based weight in the document. Actually, such weighting system shows how much a word is important for a particular document. This process is used frequently in data retrieval. The weight of a word is increased by the increase of its repetition in a certain text, but is controlled by the number of words in the context, because if the text is lengthy some words would be repeated, naturally, though they do not have any significance in the meaning. Term frequency is a criterion for the range of common and repetitive words in a text, which is calculated as follows: ( , ) = 0.5 + 0.5 × where in the numerator, d is the number of words in the selected text. w is the most frequent words in the selected text.
IDF (inverse document Frequency) is a criterion for the range of the most frequent and repetitive words. This criterion is achieved by dividing the total number of texts in the number of texts including the common word. For example: suppose that there are 1000 texts in the whole databases. If there is a certain word in all of them (like, is) the result of an algorithm is 1000 divided by 1000, which is 0, that is, this word is among the common words and must be taken the coefficient of 0. However, if the repetition is occurred in 500 texts, the result is 1 and takes the coefficient of 1. The more the repetition of a word, the less is the IDF weight. In case a word has no repetition and dominator becomes 0, we put +1 in dominator, which is calculated through second formula: where, D is the number of existing texts in the numerator and the number of texts bearing the word in the dominator. The TF-IDF is calculated through formula (3) as follows: Step 5 (saving calculations in the database): the performed calculations are saved in the database by TF-IDF algorithm.
Step 6 (improving the extraction of the proposed TF-IDF): in the conventional TF-IDF, in a text in a certain language, words with the highest frequency of TF-IDF are considered as keywords in that text with the same language. However, in the proposed method, words are called keywords if their averages TF-IDF are high for that text with the same language and other languages. Therefore, the average TF-IDF is considered for a text with the same language and other languages and instead of using TF-IDF of a text in a language, its average TF-IDF is used in available languages. This simple, but useful method could improve the extraction of keywords, significantly. In this paper, average and maximum TF-IDF method for a text in different languages is also tested, the result of which outweighed the conventional one. However, the method, which calculates the average TF-IDF has the highest accuracy.
Step 7 (depicting results): this step shows those keywords, which were extracted by TF-IDF improved algorithm.
Step 8 (evaluating the accuracy rate): in this step, the keyword extraction accuracy of the algorithm is calculated through the following formula: = .
. × 100 (4) where, the number of correct extracted keywords are those words, which are common between actual keywords and the extracted one by the algorithm. The dominator is also the total number of extracted words by the algorithm as a keyword.
The pseudo-code of the proposed keywords extraction algorithm is presented in Figure 2. The algorithm is unsupervised and could be run on the simple text. It means that unlike supervised keyword extraction algorithms, there is no need for appropriate training data sets. As known as, providing appropriate training data is time consuming and difficult. If the data is not of good quality, it will lead to a decline in the efficiency of the supervised keyword extraction algorithms.  Figure 2. The pseudo-code of the proposed algorithm

EXPERIMENTAL RESULTS
The proposed algorithm was programed in SQL Server 2012 and Visual Studio 2013 and simulations were performed on the Intel Core i5, 64 B, CPU 2.50 GHz and RAM 21 GB. The database used for evaluating the efficiency and performance of the proposed keyword extraction algorithm has been an online dataset containing 200 news collected from BBC website in various languages. Each news is in eight languages. The reason for using such a dataset was to provide updated information, which are processed at the same time. The proposed method is assessed by counting the number of matching between extracted keywords by the proposed method and given keywords.

The results of the proposed algorithm
An algorithm is designed in this study, which is language independent and has a simple structure. In contract to language-dependent algorithms (like [26]), which are using the Persian roots for keyword extraction, this algorithm is simply functional for large databases in every language. In the TF-IDF algorithm, high-frequency words in a text, but in all languages (TF-IDF mean in all languages) were selected as keywords and the accuracy of the algorithm, considering the text in various languages is improved. It is noteworthy that in a text, non-keywords, including verbs and prepositions are repeated, considerably, so, we set all non-keywords a side at the very beginning. The proposed algorithm is applicable to all multilingual websites and here the results were shown just on BBC News Website. The database used is comprised of 200 news collected from BBC Website in eight languages (a total of 1600 news). As can be seen in Table 1, words with relatively high TF-IDF (here TF-IDF more than 20) were considered, while in the conventional TF-IDF algorithm, in every language, those independent words with highest TF-IDF value is counted as keywords. As can be seen in Table 2, in Persian language, the word "America" is detected as a keyword (in thickened Table 2 mistakenly, while in English language, three words of "America, England, and London" (in thickened Table 2 were mistakenly detected as a keyword. In other languages, two or three keywords were also known as keywords, mistakenly. Table 3 illustrates the proposed algorithm results for the selected text. As can be seen in the table, the mean TF-IDF is calculated in eight languages (the proposed algorithm) for each word depicted in Table 1 and seven keywords were selected. The selected keywords in this method are considered for all eight languages, such that for all languages in this text, keywords in the mean method, which are shown in Table 3 include Quds, Zionist, America, demonstration, people, Palestine, and Iran, in which America is detected mistakenly as a keyword for all languages. However, as we mentioned in Table 2, in the conventional TF-IDF method the number of wrong detected keywords is different and more than one word for most languages. If we evaluate the accuracy of mean TF-IDF algorithm (the proposed one) and that of the conventional algorithm, the conventional algorithm (which is shown in Table 2) 6 of 7 Persian words, 4 of 7 English words, and 3 of 7 Arabic words, as well as other words in other languages were detected, correctly. In total, in 8 languages and among 56-7*8 correct keywords, 39 were detected correctly and the accuracy of the algorithm is 0.69=39/56, while in the mean TF-IDF method, 6 of 7 words were detected correctly for all languages and the accuracy of the algorithm is 0.85=6/7. This is the case of the mean and maximum method.

The comparison of the obtained results with the other related algorithms
To evaluate the efficiency and performance, the rate of accuracy of the proposed algorithm is compared with that of the other methods. The algorithm was tested with 200 texts in eight languages, which are shown in Table 1, and 1200 correct keywords were achieved. The rate of accuracy of the conventional TF-IDF algorithm for 1014 correct words and 1672 obtained keywords is 60.6%, while the proposed algorithm, namely the mean. TF-IDF, for 1164 correct words of 1275 words, the rate is 91.3%. In the proposed algorithm with the median method, 1092 correct words of 1456 words indicate the rate of 75%. Moreover, if we calculate the accuracy rate for the maximum method, 1021 correct words of 1531 words by the accuracy rate of 66.6% is obtained. The rate of accuracy for graph-based algorithm [27] for these data is 80%. Concerning the obtained rates, mean with the accuracy rate of 91.3% is the best method. Table 4 shows the summary of results on BBC data. This suggests that the proposed algorithm not only extracted the keywords language independent, but has achieved a considerably better results. Table 5 shows comparison the algorithm with other related algorithms.  Table 5. Comparing the algorithm with other related algorithms Algoritm Accuracy The Proposed Algorithm 91.3% Graph [27] 80% Kp [28] 47.7% MSF [29] 60% GATE [30] 64.4% Habibi [1] 75% Single-Document [31] 83.2%

CONCLUSION
Data retrieval is widely applied in everyday life. Increasing the efficiency and performance of information retrieval systems is very important for their designers. We realized based on investigating the applications of the data retrieval and text mining that the keywords of a text are important and facilitate the oriantations of the processes. For example, by finding the keywords in the news or some sentences with more keywords, we could summarize or comprehend the text more easily. To achieve to this aim, an unsupervised keywords extraction algorithm is proposed based on improving the TF-IDF algorithm for multi-language texts. In the proposed algorithm, the average TF-IDF of the candidate words were calculated for the same and the other languages. Then the words with the higher averages TF-IDF were chosen as the extracted keywords. A database, which was collected 200 news from BBC website in various languages, was considered to evaluate the efficiency of the proposed algorithm. The experimental results show that the selected keywords are more similar to the mentioned keywords by the website and this confirms the reliability of the algorithm. The overall accuracy rate of the algorithm is 91.3% that it is higher than the state-of-the-art keyword extraction algorithms. We would like to introduce three strategies as our future works, to improve the proposed algorithm in application, complexity and time. Finding complex keywords could be added to the algorithm, real-time and on-line behaviour could be created by focusing on parallel processing and normalizing the feeds' addresses could be considered to facilitate access.