An improved Arabic text classification method using word embedding

ABSTRACT


INTRODUCTION
Text classification is one of the most common fields in text mining that associates a given text with one or more categories from a predefined set [1].Each document is represented by a huge number of features that define the dimensionality of the dataset, making the process of training classification models difficult and slow [2].This problem is known as the feature selection, which is the process of improving model performance by eliminating irrelevant and redundant features.Irrelevant features contain no interesting information on the topic of classification, whereas redundant features contain information that already exists in more useful features [3].Common feature selection approaches in text classification include sentiment analysis [4]- [6], text classification [7], [8], image retrieval [9], and more.It is possible to select the most compelling features from the original datasets using a variety of feature selection approaches.In fact, there are three feature selection strategies: filter, wrapper, and hybrid based [10].Each of these strategies decreases the number of features used while improving the results' precision.Although much research on feature  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol.14, No. 1, February 2024: 721-731 722 selection (FS) for text classification has been conducted for languages such as English, German, Spanish, and Turkish [11], the number of publications dealing with Arabic remains limited due to it is morphology and grammatical rules [12].
Most of the FS methods have been proposed for text classification research.This can be done through i) filter-based (also known as traditional) FS techniques, such as mutual information (MI) [13], information gain [14], and Chi-square [7], [15]; ii) wrapper-based FS uses the ranking of the available features in terms of their relative importance to select a set of features to be used in the model.It can be time-consuming and computationally expensive, but it can produce powerful models that efficiently use the available data.This technique includes recursive feature elimination, sequential backward selection, and genetic algorithms [16], [17]; and iii) hybrid FS combines both filter and wrapper methods, the idea is to use a filter method to pre-select the most promising features before applying a wrapper method to further refine the selection [18], [19].In recent years, many methods have been developed to extract the most relevant and meaningful features.We will look at some of the key methods in the following.
Tubishat et al. [4] proposed an improved feature selection method applied to Arabic sentiment analysis; namely, improved whale optimization algorithm (IWOA).They mixed information gain (IG) with whale optimization algorithm (WOA) using the support vector machine (SVM) classifier.Four datasets were used in the experiment: opinion corpus for Arabic (OCA), Arabic Twitter, political, and software.The findings revealed that the proposed method is more successful than five machine learning algorithms and two deep learning techniques such as: differential evolution (DE), grasshopper optimization algorithm (GOA), whale optimization algorithm (WOA), particle swarm optimization (PSO), genetic algorithm (GA), long short-term memory (LSTM), and convolutional neural network (CNN).The best accuracy obtained for this method compared to machine learning algorithms is 95.93%, while the best accuracy obtained compared to deep learning methods is 99.39%.
Marie-Sainte and Alalyani [20] have suggested a novel approach to improve the Arabic text classification (ATC) procedure called firefly algorithm based feature selection (FAFS), which is inspired by the firefly social behavior.Three evaluation metrics, including precision, recall and F-measure, are utilized in conjunction with the SVM classifier.The experimental tests showed that the FAFS method achieved a precision of 99.40%.In a similar vein, Singh et al. [21] have suggested another new method to improve the text classification accuracy that combines a term frequency-inverse document frequency (TF-IDF) with a Glove word embedding to identify words with similar semantic content.The most representative term with similar meanings is chosen as the one with the highest sum of TF-IDF.The authors also presented a new metric to evaluate the performance of the classifier on the reduced features.The results revealed that the suggested approach was more effective than principal component analysis (PCA), linear discriminant analysis (LDA), latent semantic indexing (LSI) and PCA+LDA.The authors have used three different corpora, namely: British Broadcasting Corporation (BBC), Classic4, and 20 newsgroups.The proposed method performed an accuracy of 96.18% on the BBC dataset, 91.12% for the Classic4 corpus, and 90.25% for the 20 newsgroups dataset.
Jin et al. [22] have developed a system for computing semantic similarity between words based on Word2Vec.For this purpose, the authors combined the word vector model, HowNet and TongYiCi CiLin, to compare the similarity of words.To enhance the similarity process and extend the coverage of all features, they improved the dictionaries and increased the size of the corpus to train the Word2Vec model.Sabri et al. [23] compared three feature vectorization techniques: TF-IDF, word count and Word2Vec.They have employed five different classifiers support vector machine (SVM), k-nearest neighbors (KNN), decision tree (DT), random forest (RF) and logistic regression.The experiments were applied to two common Arabic corpora: Arabic-CNN and OSAC-utf8.The study revealed that SVM and logistic regression models were more successful than all the other machine learning methods.The testing phase indicated that the vectorization method had a significant impact on increasing classification accuracy.
The effect of three algorithms, namely: naive Bayes (NB), k-nearest neighbors (KNN), and support vector machine (SVM) on spam email were studied comparatively by [24].They improved the spam classification quality by reducing the number of features for classifiers using four optimization feature selection methods: genetic algorithm, harmony search, PSO, and local search.For experimentation, they used SPAM E-mail dataset with 4,601 emails and 1,813 spams.The models' performances were evaluated using both their accuracy and F-measure scores.The empirical results showed that SVM was more successful than other methods for spam classification when feature selection was integrated, whereas the NB classifier reported poorer results.This paper proposes an improved FS method named removal of Arabic redundant features (RARF) to build feature subset from original Arabic dataset as well as improve model accuracy.[25], and Watan-2004 [26], show that the proposed method gives better classification results than benchmark FS techniques, like PCA, LDA, Chi-square, and MI.The remaining part of the paper is structured as follows: details of the proposed RARF method have been explained in the second 2. Section 3 is devoted to the experimental results and analysis.The last section includes a conclusion and recommendations for future work

RESEARCH METHOD
In this paper, we propose an improved FS technique that utilizes the word embedding method Word2Vec for ATC.Our contribution consists of identifying and grouping similar Arabic words based on the numerical vectors generated by our Word2Vec model.Arabic words in the same group are considered redundant features and each word group is replaced by a representative term obtained by TF-IDF weighted Word2Vec embedding.This helps to eliminate Arabic redundant features and can increase the classification accuracy.When using the dimension reduction techniques such as PCA, LDA, Chi-square, and MI, it is imperative to specify the final features of new vector space.In the case of our method, however, the number of dimensions depends on the groups obtained when applying the RARF algorithm.
The process of the proposed method involves several steps, as illustrated in Figure 1.This research explores three main stages: FS using our Word2Vec model created from five Arabic datasets, Arabic text classification, and prediction of the category of new Arabic document accurately.The following sub-sections discuss the design steps of the suggested method.

Data pre-processing
Prior to classification, it is essential to undertake pre-processing of text data.In fact, the models used to classify texts will not be able to accurately identify the text's content if it is not given in an expected format.Each Arabic document in the dataset was processed in the following steps: − Retire URLs, non-Arabic characters and symbols.− Remove diacritics (for example, change the letter " ‫"ب‬ to ‫.)"ب"‬ − Remove extra whitespace and punctuation marks.− Delete stopwords like " ‫حتى‬ " , " ‫من‬ " , " ‫كيف‬ " and non-meaningful words.− Normalize ‫"ٱإأآ"‬ to ‫"ا"‬ and ‫ئ"‬ ‫"ي‬ to ‫."ى"‬

Documents representation by TF-IDF
In a vector space model, each document is considered as a set of numerical values in the vector space.The number of unique features means the number of dimensions.In this work, we use the TF-IDF as a bag of words (BoW) strategy to manage each document in the dataset.We have set a maximum number of features at 3,000, which allows us to select only the relevant features and to minimize the feature space.But this is not enough, we also need to eliminate the similar features.For this purpose, we use the Word2Vec as a technique to extract similar words.

Similarity detection using Word2Vec model
Word embedding is used to represent words or sentences in a text as vectors of numerical values [27].These novel ways of representing text data have enabled an advancement in the accuracy of natural language processing (NLP) techniques, such as text classification.Word embedding is based on the linguistic concept of distributional semantics, which was pioneered by Harris [28].This theory suggests that the meaning of a word is determined by its context.Therefore, words used in close contexts tend to have similar meanings.Word2Vec is one of the most widely known word embedding algorithms.The research project was led by Tomas Mikolov and conducted by a Google research team [29].It is a double layer neural network-based algorithm that tries to learn the vector representations of words in a text.Word2Vec has two neural architectures, called continuous bag of words (CBOW) and Skip-Gram, from which the user can choose.Figure 2 shows the difference between the two models.CBOW takes the words in the context of a sentence and attempts to identify the target word, while skip-gram does the opposite and tries to predict the words that are in the context of the given word.[29] To measure the similarity between features, we trained five Arabic datasets of different sizes and classes in order to create our own Word2Vec model.For this purpose, CBOW architecture was used because it is generally faster to train than skip-gram architecture and has better accuracy for frequent features [30].This model generated 147,873 feature vectors with 300 dimensions, which would be used in the Arabic feature similarity measure.After this step, we chose a similarity threshold of 0.7 (between 0 and 1), because our empirical findings indicated that the classification can achieve higher accuracy results when using this threshold.If the similarity measure between the Arabic words exceeded this threshold, the words were grouped into the same set.Finally, we represented each set by TF-IDF scores weighted Word2Vec.
To train the Word2Vec model, a large series of experiments have been performed to adjust the hyperparameters (window size, vector size, and minimum count).We trained a batch of documents from five Arabic datasets using generate similar (Gensim) tools developed by Rehurek and Sojka [31].It is a free and robust module for NLP which is used to generate word and document vectors.Table 1 shows the hyperparameters used to train our model as well as the Arabic datasets used.[25]; Watan-2004 [26]; ANTCorpus [32]; Arabic-CNN [33]; OSAC [33]

Dimension reduction methods
Dimension reduction in text classification is a technique used to reduce complexity by minimizing the number of features used to train a text classification model.This can be accomplished by selecting the Int J Elec & Comp Eng ISSN: 2088-8708  most relevant features and removing redundancies.To evaluate the performance of our method, the following four well-known methods were used: PCA, LDA, Chi-square, and MI.PCA is a statistical method used in text classification that reduces the number of features by transforming a set of related variables into a smaller set of variables that explain most of the variance in the original set [34].LDA uses a probabilistic graphical model to infer topics from a dataset by analyzing word frequency distributions [35].The topics inferred by LDA can then be used as features for supervised learning algorithms in order to predict the class of a given text.The Chi-square test is used in statistics to test the independence of two events, it can select features that are the most likely to have an impact on the target variable and allow the model to be more accurate [7].MI is a selection method that measures the dependence of an independent variable on the target variable.As such, MI is zero when two variables are independent, while a higher value reflects a greater dependence [36].

Removal of Arabic redundant features
In this section, we describe of our proposed method for Arabic text classification.It makes use the Word2Vec embedding technique and term weighting TF-IDF.The aim is to find features with certain semantic relationships.The method steps could be synthesized by algorithm 1.

Example:
Table 2 shows the TF-IDF matrix  in a dataset contains  documents using m features.  represents a statistical metric that evaluates the improtance of feature  in document   relative to the dataset.We suppose that the  ℎ group contains the features:  1 ,  4 ,  6 and  7 : _  = { 1 ,  4 ,  6 ,  7 }.Table 3 represents the similarities between words of the  ℎ group.Where __ 1 _ 4 is the similarity value between first and 4 th feature.) by   in the matrix.

How does the proposed method differ from existing methods?
The RARF FS method groups similar words by combining TF-IDF values and similarity rates generated by the Word2Vec model.To do this, it requires a simpler mathematical calculation compared to statistical-based methods.When applying the PCA, LDA, Chi-square, and MI FS methods, it is important to specify the number of features to be used, and therefore the features to be reduced.However, the number of features need not be specified for the RARF approach.The RARF method is based on a threshold (between 0 and 1); if the similarity measure between Arabic words exceeds an empirical value of 0.7, the words are grouped together in the same set.PCA can fail when the data is too complex, and it does not work with data that is highly imbalanced.When the data is highly unbalanced LDA will not be able to learn much useful information from it.In multi-class datasets, LDA may struggle to separate classes and accurately classify new data points.Also, if the data does not follow a normal distribution, the Chi-square test will not be effective.MI is only suitable for discrete variables, so if the variables are continuous, then other methods need to be employed.In such cases, RARF can work well because imbalanced data, discrete variables, multi-class, and distribution are not blocking parameters for it.

Classifiers
Text classification is well known in many languages, especially in English [37].Despite the importance of the Arabic language, little research has been conducted on Arabic text classification using the concept of similarities between Arabic features.Machine learning classifiers are essential for text classification because they provide an automated and powerful way to identify patterns in text documents.In this work, we have used SVM, KNN, and NB common classifiers to assess the efficiency of our proposed method.SVM are a series of machine learning algorithms that solve problems like classification and regression [38].They divide data into distinct categories using the simplest boundary possible, in order to maximize the distance between the separate groups of data and the boundary that separates them.KNN is a standard classification algorithm that relies exclusively on the choice of the classification metric.The idea is the following: from a labeled database, we can estimate the class of a new data by looking at the majority class of the k closest neighboring data (hence the name of the algorithm) [39].The only parameter to set is k, the number of neighbors to consider.NB classifier is a type of simple probabilistic Bayesian classification based on Bayes' theorem with a strong independence (called naive) of assumptions [40].It uses a naive Bayes classifier, belonging to the family of linear classifiers.

Data collection
Data collection can be described as a gathering of text-based documents that can be divided into various categories.We used two Arabic benchmark datasets with various numbers and sizes of categories to conduct our studies.The Khaleej-2004 dataset is a commonly used reference for Arabic datasets.It is a collection of 5,690 documents distributed in 4 classes: economy (909 documents), international news (953 documents), local news (2,398 documents) and sport (1,430 documents).The Watan-2004 is a large Arabic corpus containing 20,291 documents.Each document is tagged with one of the following six categories: culture (2,782 documents), economy (3,468 documents), international (2,035 documents), local (3,596 documents), religion (3,860 documents), and sports (4,550 documents).The datasets are divided into two parts: The training set contains 80% of the dataset documents while the test set represents 20%.The distribution of the training and test sets is represented in Table 4.

Performance evaluation
The matrix (also known as error matrix) is widely used to summarize the performance of a classifier model.It presents the numbers of real and predicted labels.This matrix is a two-dimensional table consisting of two columns and two rows that indicate four meaningful values: true positives (TP), true negatives (TN), false positives (FP) and false negatives (FN) as shown in Table 5.The common metrics that can be measured from a confusion matrix are: accuracy, precision, recall and F-measure.These metrics are used to interpret the results of our method.− Accuracy: it is the ratio of correct predictions to the total number of predictions.It can be defined by (2).
− Precision: it is the ratio of true positive to all predicted positive.It is also called positive predictive value (PPV).It can be defined by (3).− F-measure: it is the harmonic mean of precision and recall and it is given by (5).

Results and analysis
To make an effective comparison of our approach with the other FS methods, we have used three classifiers that work differently SVM, K-NN, and DT.The results were examined using frequently used evaluation measures: accuracy, precision, recall, and F-measure.For the Khaleej-2004 dataset, Tables 6, 7, and 8 represent the results of five FS methods PCA, LDA, Chi-square, MI, and RARF when SVM, K-NN, and DT machine learning models are used.The experiments were performed to the effectiveness of the RARF proposed approach compared with other FS methods for the three classifiers.
The calculated performance measures for the SVM classifier are shown in Table 6 which depicts that the RARF method outperformed all other feature selection techniques with a maximum accuracy of 94.75%, while Chi-square achieved the second highest accuracy of 94.20%.In addition, the highest precision of 94.07% is obtained by RARF, and the lowest precision of 92.96% is obtained by MI.The highest recall obtained by the techniques is 94.59%, and 94.26% in RARF, and Chi-square, respectively.RARF is outperformed by an F-measure of 94.30%.Chi-square, MI, and PCA techniques have an F-measure of nearly 93%, while the LDA scored the worst recall performance at 72.79%.
Table 7 shows the performance for Khaleej-2004 dataset using K-NN classifier.The accuracy obtained by the RARF is 92.51%, 92.36%, 92.01%, and 92.15% in accuracy, precision, recall, and F-measure, respectively, while MI achieved the second highest accuracy of 85.04% with a variation of 7.43% compared to the RARF.It should be noted that the LDA method achieved the lowest efficiency, close to 77% for all four metrics.Table 8 summarizes the results achieved by the NB classifier.The first point to mention is that the RARF method has achieved the best performance with 88.10%, 88.67%, 87.72%, and 88.09% in accuracy, precision, recall, and F-measure, respectively.Though, the NB classifier underperforms compared with SVM and K-NN.The MI achieved the second highest performance of 83.30%, 87.54%, 82.53%, and 83.77% for accuracy, precision, recall, and F-measure, respectively.It should also be noted that the PCA offers the poorest performance, with a precision of 65.36%.To summarize, for the Khaleej-2004 dataset, the RARF performs better than PCA, LDA, Chi-square, and MI methods, the highest rate of accuracy, precision, recall, and F-measure is achieved when RARF is applied in conjunction with an SVM classifier, at nearly 94%, followed by the K-NN, at nearly 92%.Tables 9, 10, and 11 represent the results of the Watan-2004 dataset, Table 9 shows the accuracy, precision, recall, and F-measure of FS methods using SVM classifier.The accuracy obtained by the techniques is 92.15%, 67.90%, 92.89%, 92.68%, 94.01% in PCA, LDA, Chi-square, MI, and RARF, respectively.RARF achieves 93.60% better precision, while the LDA offers a minimum precision of 56.23%.The highest F-measure obtained by the techniques is 93.65%, and 92.50% in RARF, and Chi-square, respectively.Table 9 shows that RARF outperformed all the other FS techniques, with a highest recall of 93.97%, while Chi-square and MI achieved the second highest recall of 92.50% and 92.27%, respectively.Table 10 shows the final scores of the evaluated FS methods.These results present the scores achieved by the base classifier K-NN.We can make several observations from these findings.First, the LDA obtained the lowest average performance score for the most experiments with a minimum precision of 54.79%.Second, all the RARF metrics have a score of nearly 87%, which is comparatively better, monitored by Chi-square with an accuracy of 85.73%.
Experimental results for the NB classifier are shown in Table 11.The RARF method has always proved its effectiveness in the classification process, with scores between 89% and 90%.Followed by the Chi-square with values close to 84%.We also found that the LDA classifier consistently delivers the worst results, with scores of less than 67%.For the Watan-2004 dataset, we can assume that the SVM classifier gives encouraging results compared to the K-NN, and NB classifiers.The RARF method generates significant results compared to the PCA, LDA, Chi-square and MI feature selection methods.Additionally, we can note that the LDA is producing unsatisfactory results.

Figure 1 .
Figure 1.The main structure of the proposed method

−
Recall: it is the ratio of true positive to actual positive.It measures how sensitive a model is to the positive class.It can be defined by (4).& Comp Eng, Vol.14, No. 1, February 2024: 721-731 728 Int J Elec & Comp Eng ISSN: 2088-8708  An improved Arabic text classification method using word embedding (Tarik Sabri) 729 First, we have generated our Word2Vec embedding model from five Arabic datasets with different sizes and classes.This model is able to compute the similarity between Arabic words.The second step involves grouping similar

Table 1 .
Word2Vec hyperparameters and datasets used to train the model

Table 4 .
Training set and test set for benchmark Arabic datasets

Table 6 .
Performance analysis for Khaleej-2004 dataset using SVM

Table 8 .
Performance analysis for Khaleej-2004 dataset using NB

Table 9 .
Performance analysis for Watan-2004 dataset using SVM

Table 10 .
Performance analysis for Watan-2004 dataset using K-NN

Table 11 .
Performance analysis for Watan-2004 dataset using NB