An intelligent auto-response short message service categorization model using semantic index

ABSTRACT


INTRODUCTION
The advancement of communication technology and extensive use of mobile phones has brought instant messaging; short message service (SMS) to take the top place among the various other modes of communication because of its low cost, easy usage, and fast messaging rate.To send and receive SMS, both the sender and the receiver must be within a proper network and no internet connection is required.Owing to its popularity and affordable service of telecommunication, the service of SMS is widely used by government organizations, banks, and businesses to communicate information or notify their users or clients.According to the research work carried out earlier, the categorization of emails and SMS are not alike.Emails are provisioned with sophisticated techniques [1] to differentiate between messages by using content and metadata.On the other hand, SMS spam filtering is yet not potent as most techniques that label SMS [2]- [4].They carry the limitation of performing feature engineering manually; this manual feature extraction is considered to be an inefficient approach.In an attempt to use traditional machine learning techniques, one has to have good knowledge of the problem working area or domain.Without this prior knowledge, designing and selecting features is considered a difficult task [5].

923
The standard of the machine learning model depends not only on the dataset but also on how efficiently the patterns are encoded in the features of the data.Identification of appropriate features demands a good grip over that domain and requires prior knowledge of expertise.The components selected must be reexamined based on the feature importance graph and information gain.Only after this it is achievable to filter out the features that are useful for the classification and those that are not.It is an arduous and time-wasting process to iteratively examine the features by this trial-and-error method [6].One of the possible ways to eliminate the inefficiency of feature extraction can be done using automated feature extraction.
Most of the research work carried out for the classification of SMS was binary classification (i.e., into ham and spam).Spam refers to the irrelevant messages [7].Ham is labeled for general communication among people and contains the desired information.This categorization is further extended by classifying SMS data into multiple classes; most are based on machine learning algorithms.The classification of SMS is a very old problem, and numerous research has been performed to solve this problem using various algorithms and datasets.
Various machine learning algorithms [8], [9] are used to build a SMS classification model.They discussed in depth different classifiers viz support vector machine (SVM), naive Bayes (NB), decision tree (DT), logistic regression (LR), random forest (RF), AdaBoost, artificial neural network (ANN), and convolutional neural network (CNN).The experimentation was done on two different datasets like "SMS spam collection V.1", "SMS spam collection database" and "Spam SMS Dataset 2011-12" from Kaggle for training and testing the model.In [9], the total work is defined with 4 phases: experimental database, feature creation, feature selection (CV), and comparison of algorithms.A deep learning model is used for addressing the same problem [10].They used convolutional neural network classifier, to train and test the model with "Tiago's dataset".Their model consisted of two convolutional layers each with 32 filters, a kernel of size 3, the activation function used was rectified linear unit (ReLU) and the max pooling had a pool size of 2. They suggested future research with the improvement of the architecture of the CNN network.A new method GentleBoost classifier (boosting classifier) [11] is introduced to overcome the issues of computation and space complexities.They trained and tested the model with "Tiago's dataset".The model consisted of several steps which include tokenizing the message followed by feature extraction.On comparing their model with naïve Bayes, they concluded that GentleBoost took less space and computational complexity.In [12], a model is built with different machine learning algorithms such as naive bayes, support vector machine, and maximum entropy classifier on the "SMS spam collection" dataset.Their model mainly consists of two steps, tokenization, and formation of DocumentTermMatrix.A deeplearning model on "UCI SMS Spam" using the recurrent neural network (RNN) algorithm is proposed [13] for identifying ham and spam messages.They considered different parameters for their model like batch size, max sequence length, RNN size, embedding size, min word frequency, learning rate, training data, and test data.They examined a particular training algorithm, and a cycle, or period of applying the considered algorithm on all training vectors which is known as an epoch.A model on "UCI SMS Spam" dataset is built and validated with different deep learning and machine learning algorithms [14].They used naive bayes, logistic regression, knearest neighbors, decision tree classification, support vector machine, random forest, and long short-term memory (LSTM) classifiers.Artificial neural network classifier [15].The model accuracy is validated on "UCI benchmark dataset" with deep learning and machine learning algorithms [16].They used naive naïve Bayes (NB), random forest (RF), gradient boosting (GB), logistic regression (LR), stochastic gradient descent (SGD), convolutional neural network, and long short-term memory.Their model includes multiple steps for the creation of a word matrix, feature extraction, and classification.The problem is addressed with multinomial naïve Bayes (MNB), support vector machine (SVM), random forest (RF), and Adaboost classifiers [17].The message dataset is collected from volunteers, and students attending Thapar University [18].A deep analysis of machine learning and deep learning algorithms used to address the same problem, and concluded that SVM and NB are mostly used, but SVMs are directly proportional to the size of the dataset, SVMs are not based on probability hence they are not suitable for the classification of labels.The performance of the model is improved with optimization techniques like particle swarm optimization [19], artificial bee colony [20], bat algorithm [21], bees algorithm [22], cat swarm algorithm [23], approximate muscle guided beam search [24] are some of the recently best-proven algorithms for classification problems.A step-by-step process of SMS classification model on "UCI SMS Spam" with a word embedding technique term frequency and inverse document frequency (TF-IDF) [25].They have used 5 different machine learning algorithms in this model-multinomial naïve Bayes (MNB), k-nearest neighbor (KNN), SVM, DT, and RF.The random forest algorithm has achieved the highest accuracy among all the algorithms.The only drawback of this method is that the dataset is imbalanced, so it is not enough to evaluate the performance based on accuracy.A variety of deep learning algorithms like LSTM and gated recurrent unit (GRU) [26] are used and compared the results between the proposed model and machine learning algorithms like SVM and NB proposed by Almeida and Hidalgo.The analysis of the performance of model on "SMS spam collection v.1" dataset using naïve Bayes classifier [27] is outperformed to classify spam and ham messages with other classification algorithms.A response model for SMS [28] was developed with alert 1 (does not need immediate attention), alert 2 (requires attention within the day), and alert 3 (requires immediate attention ASAP).They used  [29], by SVM algorithm on the "UCI SMS SPAM dataset" is proposed with fusion of natural language processing (NLP) techniques for pre-processing.An extended multi-class model by using the logistic regression algorithm for classifying SMS into regular, info, ads, and fraud is developed [30]- [34].It is implemented on the real-time dataset by collecting messages from multiple volunteers.The performance is compared with different algorithms like KNN, DT, and NB.Table 1 shows the summary of the related works discussed above.
In summary, the manuscript is organized into the following sections: section 2 explains the design and implementation of the proposed SMS auto-response classifier with model refinement parameters.Section 3 depicts the methods involved in the auto-response SMS categorization model focusing on content preprocessing and feature extraction methods content-based and semantic-based.Section 4 illustrates the results and discussion with model performance and validation of real-time data.NA NB 89\% Jain et al. [29] UCI SMS Spam SVM 98.7\% Dewi et al. [30] Dataset from seven volunteers LR, NB, KNN, DT 97.5\% (LR)

SMS AUTO-RESPONSE CLASSIFIER
The auto-response classifier uses a multi-layer perceptron (MLP) a feedforward neural network (FNN) model.The MLP model is built from three layers-an input layer, and hidden layer(s) and followed by an output layer.The first layer; the input layer is responsible for acquiring input data (SMS) that must be classified.After the input layer comes an arbitrary number of hidden layers that make up the computational engine of the multilayer perceptron model.The output layer is the final layer which performs the classification and labeling of the SMS into its respective class/category.The data flow is from the input layer to the output layer (i.e., forward propagation), which is why the MLP model is considered to follow the feed-forward networking method.The nodes (neurons) of the model are capable of analyzing continuous-natured functions and then performing tasks such as pattern recognition, prediction, and classification.Next on testing and training datasets, feature extraction is done.The word vector is created based on the training dataset, after which two-word count matrices are created by transforming the training and testing parts of the dataset.The next step is to train the MLP model; a multi-layer perceptron has 3 different parts and for each input, there is one neuron (or node) associated.It also has one output layer with a single node.For each output, it can have any number of hidden layers, and each hidden layer can have any number of nodes.When it comes to the classification of SMS into multiple classes, the word count matrix obtained from the feature extraction step is passed to the MLP model as an input, each cell in this matrix is considered as a node of the input layer, and whether each node of the input layer is provided with a random weight.These weights with some bias values help in determining the output value from each node.The output of the node is generally some integer value but for classification, we need binary results (i.e., whether to consider the output of node 1 or not to consider 0) for this purpose we use the activation function.The activation function decides whether to activate a neuron or not.Here the bias is used to shift the curve of the activation function from right to left or left to right based on which the neuron gets activated or deactivated.Figure 1 represents the complete process that occurs within a single node to produce an output.The individual node mathematical function is represented in (1).The various mathematical equations used in this experimentation as activation functions for impacting the dimensions of the features.The identity activation function is a simple linear function that returns the same value as its input.The mathematical representation for the identity activation function is given in (2).
The ReLU activation function is a commonly used activation function in neural network models.It is a nonlinear function that applies a threshold operation to the input, setting any negative values to zero and leaving positive values unchanged.The mathematical equation of ReLU is given in (3).
The logistic activation function is commonly used in classification problems.It is a sigmoid function that maps any real-valued input to the range [0, 1].The mathematical equation for the logistic activation function is given in (4).
The hyperbolic tangent (tanh) activation function is a commonly used activation function in neural networks.It is a sigmoid function that maps any real-valued input to the range [-1, 1].The mathematical formula for the tanh activation function is given in (5).
The results from the classifier vary based on the number of hidden layers used and the size of each hidden layer.By default, only one hidden layer is used with size 100 (max_iter), it is the epoch for the classifier (by default the value is 200) and the default activation function is rectified linear activation function (ReLU) [21]).A detailed view of the different accuracies based on certain parameters and changing the hyperparameters are discussed in further sections.

METHOD
The proposed auto-response SMS categorization model has been built to predict multiple classes based on the semantics (information) embedded in it.In this approach, each SMS message is classified into one of the pre-defined categories -ham, spam, info, transactions, and one-time password (OTPs).The process includes identification of the SMS dataset, formatting and preprocessing the dataset, sematic-based feature extraction, and Auto-response SMS classification using MLP on both training and testing data.The model outputs a SMS class after performing validation with unlabeled and real time data.The model is built by using the MLP algorithm.The MLP model is built with at least three layers namely the input layer, hidden layer(s), and the output layer.Each layer is defined as a collection of perceptron's (nodes).Every node is associated with an activation function that determines the node to which the output of the current node must be transferred.The entire dataset is first preprocessed and formatted by editing the column names and adding extra columns if necessary.The formatted data is then fractionated as training data and testing data.As the name defines, training data is used to train the MLP model, whereas the testing data is used to examine/test the model that is trained.After this comes the feature extraction.In this, the strings (SMS text) are converted into vector format which is used as an input to the MLP model.The MLP model applies bias values and activation functions on the nodes to find out the class to which the SMS belongs.There are five main classes into which each SMS can be uniquely classified, they are ham, spam, info, transactions, and OTPs.The entire process flow is shown in Figure 2.

Dataset
To perform this experiment the dataset used is multi-class SMS dataset [20].The data set consists of 5 categories of messages, they are ham (1,729), spam (1,772), alerts (465), OTP's (1,625), and transactions (1,807), a total of 7,398 messages.The unlabeled data used in this study consists of a collection of real-time messages obtained from students at the Institute of Aeronautical Engineering [IARE].This data set comprises 100 messages, with 20 messages from each class of SMS, including personal messages, multiple transactions, OTPs for various purposes, spam messages, and SMS from various shopping sites.The unlabeled data is utilized to test the model's performance.

Data preprocessing
In this phase, the original data has been cleaned and reformatted with the selected feature set.In data cleaning, punctuations like !"#$%&'()*+,-./:;<=>?@[\]ˆ '{} are removed and then all the stop words are removed from the messages.After removing stop words and punctuations, all the messages in the dataset are converted to lowercase to ensure that all the words in a message fall under the same letter case.Once the data cleaning process is completed a clean and noise-free dataset is available.Now data formatting is performed on cleaned data with transformation, renaming of the existing columns, and newly derived features.The message "type" column is renamed as "label No" and the "message" column as "text".And newly derived features are the "category" and "length" of each message.The "category" is derived by the type of the message based on the "label No" value i.e., 0: ham, 1: spam, 2: alerts, 3: transactions, 4: OTPs.The "length"s column consists of the length of each message in the dataset.The columns that are newly created are used for analysis, creating training vectors, and visualizations.Table 2 shows the dataset after data formatting.

Feature extraction
The process of converting raw data into numerical values that can be utilized/processed by a classifier is called feature extraction.This feature extraction process ensures that there is no loss of information from the raw data.In our case, the raw data is the available dataset consisting of the messages which are in the form of strings.This raw data is not accepted by the ML model, it cannot be used directly for training and testing of the model.Hence there is a need to extract the numerical features from the raw data.The numerical features from the messages will be the frequency count of each word in the dataset and the label num values (0: ham, 1: spam, 2: alerts, 3: transactions, 4: OTPs).The feature extraction process is done in the following two steps: Step 1 is creation of semantic-word vector, semantic-based word vector and the transformation of the text document.The creation of a semantics-based word vector is a map of words where each word in the dataset or the text document is considered as a key, and the index of each word is based on the ordering of the words is assigned as the value, before preparing this word vector, stop words are removed from the dataset or the text document.Step 2 is transformation of text document, semanticword vector obtained from the above step is used to transform the text document or the dataset into the matrix where each column of the matrix is a word from the text document or the dataset, each row represents a particular sentence or text sample in the text document and each cell represents the count of each word in each sentence of the text document.Now these obtained matrices or vectors are provided to the model for training and classification as shown in Figure 2.

SMS categorization model
The SMS categorization model is constructed with the help of a multi-layer perceptron by considering both content-based and semantic-based features.The response to the model is labeled as 5 classes viz.Ham, spam, alert, transaction, and OTP.The evaluation of the proposed auto-response SMS categorization model is observed with the help of precession, recall, and F1 score.

RESULT AND DISCUSSION
The proposed SMS Auto-response model is built using an MLP classifier on multi-class SMS dataset.The model is constructed with a minimal number of layers.The activation function used is ReLU and up to 6,000 epochs model is iterated to find the optimal response.

Dataset statistics
The dataset comprises a total of 7,398 messages, including 1,729 ham messages, 1,772 spam messages, 465 alert messages, 1,807 transaction messages, and 1,625 OTP messages.The testing and training data are detailed in Table 3.The formatted SMS data is split into two segments-testing and training.In this experiment, the test size parameter is observed to be 0.2 which indicates 20% of the dataset messages are used for testing and the remaining 80% of messages are used for training the MLP model.Figure 3 shows the visualization of semantics involved in various categories of SMS messages in the form of the word cloud which depicts the most repeated words of each class.The analysis of the word cloud reveals that certain words, such as "number" and "person", are predominantly repeated in the ham class, this class mainly consists of messages related to general communication.In contrast, the spam class contains numerous words that are repeated with similar frequency, such as "date" and "organization" and many more which are generally fraud messages.The alerts class consists of words like "dear", "customer", and "time", which are related to providing information to customers, while the transaction class is associated with words like "duration", "ac", "number", and "debited", these messages are related to financial transactions.Finally, the OTP class is distinguished by the frequent repetition of words like "OTP", "code", and "verification".More details about prominent semantics involved in each category are given in Table 4.  "person," "come," "number," "ok," "date," and "time" Spam Does not serve any meaning "organization," "tc apply," "money," and "date" Alert Sent by companies to contact their customers "dear," "customer," "number," "recharge," "successfully," and "time" Transaction Involve bank transactions where money is debited or credited "ac," "number," "debited," "credited," "balance," and "available" OTP Used for verification or validation purposes "otp," "code," "verification," "number," "organization," "valid," and "verify"

SMS auto-response model accuracy
MLP Classifier has achieved the highest accuracy of 97% when max iter is 3,000, activation is "ReLU", and hidden layer sizes are 150, 100, 50, 25.Later a confusion matrix is plotted, and the below confusion matrix in Figure 4 shows the number of true positives (TP), false positives (FP), false negatives (FN), and true negatives (TN) produced by the classification model.These values are used for calculating the accuracy of the model.

929
From Figure 4, it is identified that out of 347 ham messages 338 are identified as ham, 5 are identified as spam, 4 are identified as alerts, and 0 are identified as transactions and OTPs.Out of 343 spam messages 324 are identified as spam, 8 are identified as ham, 10 are identified as alerts, 1 is identified as transactions, and 0 as OTPs.Out of 96 alerts, 80 are identified as alerts, 5 are identified as ham, 9 are identified as spam, 0 are identified as transactions, and 2 are identified as OTPs.Out of 367 transactions, all are accurately identified as a transaction.Out of 327 OTPs all are accurately identified as OTPs.
Table 5 shows precision and recall values for each category.Transactions and OTPs are SMS of high priority which are perfectly classified by the model.Ham and spam are personal messages whose precision and recall values are greater than 0.9.Alerts are considered to be least priority messages as they contain advertisements.The precision and recall of alerts are around 0.8.
Next several accuracy comparisons have been performed which are mentioned below.Figure 5 shows a comparison between accuracy and hidden layer sizes when tested with different activation functions like ReLU, logistic, Tanh, and identity, the results of each activation function are represented with different colors.It is observed that the highest accuracy of 97% is achieved by ReLU activation function when hidden layer sizes are fixed to {150, 100, 50, 25}.
From Figure 5 it is observed that the highest accuracy is achieved when the activation function is set as RELU and hidden layer sizes are set as 150, 100, 50, 25.Hence these values are considered and kept constant for comparison between Epoch and accuracy.Table 6 shows the accuracy trend with respect to number of Epochs, by keeping the activation function and hidden layer sizes constant.Figure 6 shows a comparison between the accuracies of algorithms used in various papers.From the research work that is done, it is identified that multi-class SMS datasets related to the existing research either are not provided, or the dataset is constructed by researchers themselves which is not publicly available.Hence the comparison is done with the algorithms used in previous works based on the existing MULTI-CLASS SMS DATASET.The highest accuracy (97%) was achieved by multi-layer perceptron (MLP) and linear regression (LR).Among MLP and LR, MLP model is given more preference due to the drawbacks of linear regression in terms of classification problems.Linear regression works only on continuous values, which is not the case for classification (i.e., classification works on discrete values).Another drawback is that the threshold value must be shifted each time new data points are incorporated into the graph.The next highest accuracy was achieved by SVM at 96.8%.Although the classification accuracy of SVM is high, it is not suitable for large datasets due to its training complexity.The next algorithm compared is NB which has an accuracy of 96.4%.Regardless of its accuracy, NB follows bad estimation as it works on the assumption of independence between predictors.

Model validation-accuracy of unlabeled data
The unlabeled data set comprises 100 SMS messages, with 20 messages from each of the ham, spam, alerts, transactions, and OTPs classes.These messages were obtained from students at the Institute of Aeronautical Engineering (IARE) and are not pre-classified into specific SMS categories.Instead, the messages were directly fed to the model for classification, and the results were compared to manual classification.indicate that out of the 20 ham messages, 17 were correctly classified as ham, while 2 were misclassified as spam and 1 as alerts.Of the 20 spam messages, 15 were accurately classified as spam, 4 were misclassified as ham, and 1 was misclassified as alerts.Out of the 20 alerts, 12 were correctly classified as alerts, 3 were misclassified as ham, 4 were misclassified as spam, and one was misclassified as OTP.For the 20 transactions, 16 were classified correctly as transactions, 2 were misclassified as ham, 1 was misclassified as alerts, and 1 was misclassified as OTP.Finally, all 20 OTP messages were correctly classified as OTPs.

CONCLUSION
The SMS auto-response model helps the users to distinguish between the various messages received over their mobile phones.This model categorizes SMS messages into five main classes-ham, spam, info, Transactions, and OTPs based on the semantics involved in the messages.A neural network model: multi-layer perceptron is implemented on the multi-class SMS dataset to address the task of message categorization.The five-class text message classification runs successfully with an accuracy of 97% with MLP.The limitation of the work is that it is mainly dependent on the text messages received in the English language only.Thus, this paper could extend the future study to incorporate similar neural network algorithms to classify SMS written in various languages.This paper also can be extended by employing various deep-learning techniques for text message classification.


ISSN: 2088-8708 Int J Elec & Comp Eng, Vol.14, No. 1, February 2024: 922-933 924 the NB algorithm for the classification of SMS.A binary classification of SMS into spam and ham messages

Figure 2 .
Figure 2. Schematic diagram of the proposed auto-response SMS categorization model

Figure 3 .
Figure 3. Illustration of category-wise semantics visualization in word cloud

Figure 7
depicts the accuracy of the model with unlabeled data illustrated in the confusion matrix.In this, 0 indicates ham class; 1 indicates spam class; 2 indicates alerts class; 3 indicates transactions class; and 4 indicates OTPs class.The results

Figure 7 .
Figure 7. Illustration of accuracy of SMS auto-response model with unlabeled data

Table 2 .
Structure features An intelligent auto-response short message service categorisation model … (Budi Padmaja) 927

Table 3 .
Training and testing dataset details

Table 4 .
Category wise semantic information

Table 5 .
Classification report

Table 6 .
Epoch vs accuracy