Text classification supervised algorithms with term frequency inverse document frequency and global vectors for word representation: a comparative study

ABSTRACT


INTRODUCTION
In numerous real-world applications, text classification challenges have been extensively investigated during the past few decades.Recent advances in natural language processing and text mining have piqued the interest of numerous researchers in the creation of applications that utilize text categorization algorithms.These advancements have not only enhanced the accuracy of text classification, but also expanded it is scope.Text classification models have produced impressive results in tasks such as sentiment analysis, machine translation, and document summarization by combining deep learning approaches and word embeddings such as global vectors for word representation (GloVe).As a result, the opportunities for leveraging text classification continue to grow, promising enhanced automation and information retrieval across a wide range of domains.
Classification of documents is a problem involving the construction of models that can categorize documents into predetermined categories.It is a complicated process that comprises training models, data  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol.14, No. 1, February 2024: 589-599 590 processing, transformation, and reduction.This remains a noteworthy research area, utilizing numerous strategies and their sophisticated algorithmic combinations.An initial classification of documents into distinct categories simplifies numerous document processing processes and improves the overall performance of document processing systems.The bulk of document classification algorithms now use text content or document structure to classify documents such as insurance papers, letters, and essays.This work addresses document classification challenges by considering the content of the document rather than the structure.
Selecting the optimal classifier is the most crucial step in the classification of text.We cannot choose the most effective model for a text categorization application until we have a thorough conceptual understanding of each approach.In the next section, the most common supervised text categorization approaches are discussed.First, we will cover non-parametric algorithms that have been explored and applied for classification problems, such as k-nearest neighbor (KNN) [1].Support vector machine (SVM) [2], [3] is another well-known technique for document categorization that employs a discriminative classifier.This technique has been widely implemented in numerous data mining domains, including image and video processing, among others.In addition, researchers frequently utilize SVM as a benchmark to evaluate the efficacy of their proposed models and to demonstrate their original contributions.Document classification has also been researched using tree-based classifiers such as decision tree (DT) and random forest (RF) [4].Each of these tree-based algorithms will receive its own segment of discussion.The majority of these methods are applied for document summarization [5] and automated keyword extraction [6].The purpose of this research is to conduct a comparative analysis of the efficiency and efficacy of various document classification strategies.Even though there are numerous comparison studies and experiments for document categorization, their tests are sometimes "incomplete," as their conclusions are inconsistent due to the use of diverse data sets.We explore the effectiveness, efficiency, and scalability of several document classification techniques.
The paper is structured as follows: in section 2, an overview of feature extraction and classification techniques is presented.Section 3 examines the main issues in text classification and provides a survey of current solutions.Section 4 outlines the generic strategy utilized in the survey, offering insights into the methodologies employed.Section 5, delves into the experimental phase and presents an evaluation of the utilized methods and approaches, discussing their effectiveness and performance.Finally, in section 6, the paper provides a comprehensive summary of the main points discussed throughout the study.

RELATED WORK 2.1. Feature extraction
Although the term "word embedding" has gained popularity because of the development of neural network techniques, the first attempts to create distributed representations were made in the context-counting field.The co-occurrence matrix must be manually allocated in memory, which is the main disadvantage of context-counting methods.Random indexing [7], [8] was proposed to address this limitation by creating nearly orthogonal random indexes for words and then iteratively removing the factorization.When dealing with large amounts of text data, however, neural methods such as word2vec and GloVe have proven to be more effective than rule-based inference.GloVe, a well-known embedding method, has been shown to outperform word2vec in a variety of tasks [9].GloVe can learn word vectors that can be used to reconstruct the likelihood of co-occurrence between phrases based on their dot product.Both word2vec and GloVe have been used to create massive collections of embeddings that are publicly available.
Table 1 provides a comparison of three text representation models: term frequency-inverse document frequency (TF-IDF), Word2Vec, and GloVe (pre-trained).Although TF-IDF is simple to compute and use for document similarity, it lacks semantic understanding and can be slow with big vocabularies.Word2Vec can extract word order and semantics but not in-text word meaning or out-of-vocabulary phrases.GloVe (pre-trained) outperforms Word2Vec in terms of capturing word locations and meanings.

Classification techniques
Boser et al. [10] created supervised learning methods applicable to classification or regression, including the SVM.SVM was originally developed for binary classification but may be extended to higherdimensional nonlinear situations [11], [12] and is based on structural risk reduction.An SVM-based method is presented in [13] that improves the performance of the SVM classifier by incremental learning, harmful unlearning, and boosting.Boosted SVM works particularly well on high-dimensional datasets, while other approaches have improved SVM performance by enhancing vectorization algorithms.The augmented naive Bayes vectorization algorithm outperforms the TF-IDF classifier, according to a study [14], [15].Laplace smoothing improves naive Bayes-SVM classification performance beyond that of TF-IDF [15], hence the suggested approach for categorizing texts is very effective and accurate.

K-nearst neibghors (KNN)
K-NN is an efficient similarity-based learning algorithm for categorizing documents.It identifies the k nearest neighbors of a test document in the training set and assesses class candidates according to their classes.Iswarya and Radha [16] suggested an Ensemble learning strategy for the Improved KNN method for text categorization (EINNTC), which use one-pass clustering to reduce similarity calculation time and minimize noisy samples.In the first stage, a classification model is developed and updated, and in the second step, ensemble learning is used to determine the ideal value for the parameter K.In terms of F1 score, the results demonstrate that EINNTC surpasses SVM and conventional KNN.

Decision trees (DTs)
Decision trees are regarded as one of the most practical and simple approaches to classification.This technique is built through a hierarchical decomposition of the data space.D. Morgan proposed and J. R. Quinlan developed the decision tree as a classification task.The main concept is to create a tree of categorized data points based on the attribute.The classifier is a tree with internal nodes representing features, branches deviating from them representing a decision rule, and leaves and leaf nodes representing the outcome labels.A decision tree classifies a test document by recursively evaluating the labelling weights of internal nodes in the document vector until a leaf is reached.The primary problem, however, is defining which properties or characteristics belong at the parent level and which belong at the child level.The main properties are achieved by applying a metric known as Information Gain.

Random forests (RFs)
Random forests (RFs) are a type of tree predictor created by T. Kam Ho in 1995 as an ensemble learning method for text classification.In 2001, Breiman's description of random forests gained attention, influenced by Amit and Geman's similar "random trees" methods.Random forests are widely used due to their high predictive accuracy and have been successfully applied in various fields [17]- [22].In 2018, a new variation called LazyNN RF was proposed for high-dimensional noisy classification applications.The model improves on typical random forests by using a "localized" training projection that filters out unnecessary data, avoiding overfitting caused by overly complex trees.LazyNN RF outperformed state-of-the-art classifiers in almost all reference datasets tested, demonstrating it is effectiveness and feasibility as a strategy [22].

Classification techniques comparison
In the context of large-scale search problems, as illustrated in the Table 2, the effectiveness of the KNN algorithm is constrained by data storage limitations.Moreover, the efficacy of KNN is highly dependent on the definition of a meaningful distance function, making it a highly data-dependent algorithm, as demonstrated by previous research [23], [24].These observations highlight the critical considerations associated with the practical application of KNN in scenarios where storage resources and the definition of pertinent distance metrics play a pivotal role in determining the algorithm's success.
Since its introduction in the 1990s, SVM has been one of the most effective machine learning algorithms.However, they are hindered by the lack of transparency in their conclusions, which is a result of the numerous dimensions.Consequently, the company score cannot be displayed as a parametric function based on financial indicators or in any other functional form [25].A variable financial ratio rate is a further limitation [26].The decision tree is a rapid method for both learning and prediction, but it is particularly sensitive to small data changes and is easily overfit [27].Prediction outside of the sample is also a difficulty with this method.Compared to other systems, random forests are extremely quick to train, but once trained, they are slow at making predictions [28].SVM classifier gave the better results in terms of precision, recall and f-measure compared to DT [29], [30].

STATE OF THE ART TECHNIQUES
Table 3 (see in appendix) summarizes key aspects, including the used method, review element, key contribution, and corpus utilized by each methodology of four research articles addressing text classification techniques.The first article introduces a boosted SVM classifier using incremental learning and detrimental unlearning to address challenges related to SVM convergence and memory consumption in high-dimensional datasets.The second article discusses multi-class document classification using support vector machine based on an improved naïve Bayes vectorization technique, aiming to reduce the dimensionality of data while enhancing vectorization methods.The third article presents adaptive random forests for evolving data streams, proposing a technique that adapts random forests for dynamic data stream learning.The final article introduces a LazyNN RF classifier designed for high-dimensional noisy classification tasks and demonstrates its superior performance compared to state-of-the-art classifiers in various reference datasets.Each article contributes unique approaches to addressing specific challenges in text classification, and they utilize different datasets to validate their methods.

METHODOLOGY OF STUDY
We intend to provide an overview of text classification techniques in this article, along with an explanation of the relevant pre-processing processes and evaluation methods, following the workflow in Figure 1.First, we will begin with text preparation and go over the various techniques available, followed by a review of text representation, which is typically the most difficult issue in building a classifier.Phase 2 presents the document presentation and in the last part we review and evaluate the different methods of classification in the 4 different corpuses.

Text preprocessing
Text cleaning and pre-processing are crucial steps for improving the performance of text categorization.This stage involves removing unnecessary and nonsensical terms from the data.In our evaluation, each dataset underwent the following procedures: elimination of punctuation and numerals, as well as the removal of stop words.Additionally, tokenization is another essential pre-processing approach, which breaks down a text into smaller units called tokens.Tokens can be words, sentences, or other significant parts of the text.The main goal here is to ensure that sentences are correctly processed.Text documents often contain common but uninformative words like "before," "the," "after," and "a."These words are typically removed from text documents to improve analysis accuracy.Finally, stemming and lemmatization are employed to handle different forms of words while preserving their semantic meaning.This technique helps in reducing the feature space by merging various word forms into a common representation, ultimately aiding in text classification.

Text representation 4.2.1. Term frequency-inverse document frequency
Jones [31] developed the inverted document frequency (IDF) technique to reduce the influence of frequently used words in a corpus in conjunction with term frequencies.Words that appear frequently or infrequently in a document are given more weight by IDF.When combined with term frequency (TF), this yields the document's TF-inverse frequency (TF-IDF).Although the IDF attempts to address the issue of common terminology in documents, this approach has limitations.Because each word is represented independently as an index, TF-IDF ignores word similarity within the document.In recent years, however, new methods with more complex models, such as word embedding, which can incorporate notions such as word similarity and speech recognition, have been introduced.

Word embedding: GloVe
Word embedding is a category of feature-learning algorithms that entails mapping each word or phrase in a lexicon into real-number vectors (N-dimension vector).Numerous word embedding approaches have been developed to turn unigrams into inputs appropriate for machine learning models.Word2Vec and GloVe are two of the most prevalent and successful deep learning approaches.
GloVe is a robust word embedding technique that has been used for text document classification [9].In this method, words are also represented as high-dimensional vectors and trained using a large corpus of neighboring words.Pre-trained word embeddings are used in many works and are based on 400,000 trained words from Wikipedia 2014 and Gigaword 5. Word presentation is performed using 50 dimensions.GloVe also provides pre-trained word vectorizations with 100, 200, and 300 dimensions.

EXPEREMENT AND EVALUATION
In this section, we compare each of the strategies and algorithms.In addition, we investigate the flaws of current categorization strategies and evaluation methodologies.The purpose is to select an efficient technique of classification while understanding the similarities and variations between existing systems.

Dataset
Text categorization corpora are collections of texts that have been classified into distinct categories or subsets.Annotated datasets, which contain text document samples with labels, have expedited the expansion of this subject.We investigate the domain-specific characteristics of the four datasets included in this study.Table 2 provides a summary of datasets by category, average phrase length, dataset size, related publications, data sources, and expected applications.By evaluating these datasets, we gain a greater understanding of text categorization issues and opportunities.This can enhance classification techniques and tools for several applications.− IMDB:25,000 IMDB film reviews, categorized by sentiment (positive/negative).Following pre-processing, each evaluation is encoded as a series of word indexes (integers).For instance, the number "3" represents the third most common term in the data.

Extraction methods
After preprocessing the data, the TF-IDF extractor from the scikit-learn toolbox is used to vectorize the texts for input into classifiers [28].In a similar fashion, a pre-trained GloVe [32], [33] model is utilized to construct the GloVe feature extractor by averaging the vectorized word representations of the words in the document.The GloVe model was trained on data from Wikipedia and Gigaword 5 [9], with 6 billion tokens and 400,000 concepts in its lexicon [28].This technique includes both semantics and context without requiring N-grams to assess the input.This article aims to offer a thorough introduction to text categorization approaches, including preprocessing procedures, assessment methodologies, and a comparison of various algorithms and strategies.In addition, we explore the limits of current classification and assessment strategies and emphasize the difficulties in selecting an efficient classification system by comprehending the similarities and differences between existing systems throughout pipeline phases.Two tests were performed, each with a different feature extraction approach, and four ML classifiers were used.All tests were carried out on Intel Core i5-6500 CPUs with 16 GB of RAM.

Experiment 1
Prior to applying the ML algorithms, the first experiment was carried out using the TF-IDF feature extraction approach.Table 4 displays the accuracies obtained by several classifiers, with the best accuracy highlighted in bold.According to the findings of experiment 1, SVM, KNN, and RF yield high accuracy of more than 80%.Table 4 displays the classification scores when utilizing the TFIDF extraction technique and clearly indicates that the SVM classifier outperforms the TF-IDF extraction approach.When utilizing TF-IDF, the SVM classifier has four of the top assessed scores.

Experiment 2
When using Glove [34] extraction approach on data, the SVM and KNN classifiers perform equally well, as shown in Table 5.It is notable, however, that when evaluating the IMDB datasets, the random forests classifier emerges as the top performer across all metrics evaluated.This observation highlights the datasetspecific nuances that can impact classifier effectiveness.While SVM and KNN remain competitive in the majority of instances, the IMDB dataset presents a unique challenge in which the random forests classifier consistently demonstrates it is efficacy across multiple evaluation criteria.This insight emphasizes the significance of selecting an appropriate embedding technique and classifier based on the specific characteristics of the dataset under consideration, as this decision can have a substantial impact on classification outcomes.

Discussion
Figures 2 and 3 demonstrate that the maximum accuracy for recognizing the reuters dataset is 90 percent, according to the best accuracy of each approach as indicated in the Figures 2. TF-IDF consistently beats Word Embedding in most models, according to our observations.This finding might be due to several factors.Word Embeddings is unable to generate links between new occurring words and use them for training due to a lack of vectors and associations in GloVe Word Embeddings.TF-IDF, on the other hand, builds vectors using the whole vocabulary available in the train data.Overfitting is also a common issue when using word embeddings.Because word embedding is a complex type of word representation (in Text classification supervised algorithms with term frequency inverse document frequency … (Zakia Labd) 595 addition to the limited vocabulary), it is quite conceivable that the train data is over-fitted in our experiment.Another downside of using complex word representations is that they contain more hidden information, which is especially useless in our case, but we see in the results that word embeddings utilize links between words to get better precision in the case of random forests.

CONCLUSION
In recent years, text classification has risen in prominence, resulting in the application of numerous data mining methods to the text domain.The performance of many of these methods is hindered by the presence of high-dimensional characteristics and hidden meanings in text data.All of the methods presented in the article have advantages and disadvantages, and selecting the optimal classifier for the task is essential for good classification performance.A combination of an adequate classifier selection and dimensionality reduction technique would surely improve the classification outcome.
Text categorization is a major challenge in machine learning, especially as text and document datasets grow.To address this issue, it is critical to create and disseminate supervised machine learning methods, particularly for text categorization.Existing algorithms must be evaluated to improve existing document classification systems.Nonetheless, improving existing text classification algorithms requires a better understanding of feature extraction methods and how to accurately evaluate them.Text classification approaches are currently classified primarily as follows: In both academic and commercial applications, TF-IDF, TF, and GloVe are extensively used feature extraction techniques.In this study, we discussed classic supervised techniques.In contrast, text and document cleaning can increase an application's correctness and robustness.We examined the essential pre-processing techniques for text.We also define Existing classification approaches such as the KNN, SVM, DTC, RF, and conditional random field are the primary focus of this study (CRF).Accuracy and precision evaluation methodologies were applied to measure performance.Using these metrics, the classification algorithm for text may be evaluated.
This article concludes with a summary of recent developments in supervised techniques and the evolution of text categorization algorithms.It highlights the continuous progress in harnessing machine learning methods to enhance the accuracy and efficiency of text classification tasks.In the upcoming article, our focus will shift toward deep learning algorithms, exploring their most recent developments in the field of natural language processing.Additionally, we will conduct a comparative analysis of these deep learning techniques, evaluating their performance when pair with traditional text representation methods like TF-IDF and GloVe.

Figure 1 .
Figure 1.Methodology and workflow of present paper 593

Table 1 .
Comparison of feature extraction methods

Table 2 .
Comparison of text categorization algorithms (SVM, KNN, DT, and RF) − Reuters-21578: 11,228 newswires from reuters, categorized under 46 themes.It is a dataset with several classes and labels.It includes 90 total classes, 7,769 training documents, and 3,019 testing documents.− 20 newsgroups: The 20-newsgroup dataset contains roughly 18,000 newsgroup posts on 20 themes, separated into training and testing subsets.The distinction between the train set and the test set is determined by communications posted before and after a given date.− Web of science dataset: This dataset consists of 11,967 documents classified into 35 categories, including seven parent categories.

Table 4 .
The performance (precision, recall, f-measure (1)) and accuracy of the different classification algorithms using TF-IDF vectorization techniques

Table 5 .
The performance (precision, recall, f-measure (1)) and accuracy of the different classification algorithms using Glove vectorization techniques

Table 3 .
Text categorization techniques comparison using the following criteria: strategy used, review element, key contribution (novelty), and corpus of each methodology Text classification supervised algorithms with term frequency inverse document frequency … (Zakia Labd) 597

Table 3 .
Text categorization techniques comparison using the following criteria: strategy used, review element, key contribution (novelty), and corpus of each methodology (continue)