Sensing complicated meanings from unstructured data: a novel hybrid approach

ABSTRACT


INTRODUCTION
Unstructured text poses difficulties for computer programs due to its lack of a clear structure or defined data model [1].This makes it unsuitable for conventional database models, causing issues in storage, management, and indexing due to the absence of a schema and unclear structure.Consequently, search results are less accurate due to the absence of predefined attributes.In today's digital landscape, the volume of diverse unstructured text is increasing, generated from sources like web pages, research papers, and articles [2].This growth is driven by advancements in web technologies and text extraction tools [3].While semantic technologies and text mining systems aid in linking text to knowledge, processing complex language and extracting intricate insights remains a challenge [4].Information extraction (IE) algorithms help extract knowledge by identifying references and relationships between entities [5].Yet, deeper insights require natural language processing (NLP) techniques, including convolutional neural networks (CNNs), ISSN: 2088-8708  Sensing complicated meanings from unstructured data: A novel hybrid approach (Shankarayya Shastri)

713
From the above study, it can be seen that very little work has been done on classifying the unstructured data directly, without data cleaning and data pre-processing.In addition, comprehensive research is offered in [13], which discusses methods for data extraction from unstructured data.Furthermore, [14]- [17] none of these papers have dealt with the issue of data extraction from unstructured sources.Both [18], [19] discuss methods for extracting information from unstructured sources, but neither paper examines these methods in relation to other datasets.The models [18], [19] may work properly for the given respective datasets given in [18], [19].Hence in this work, we utilize the NLP and CNN and build a model called AAUT-ML to extract the text from the unstructured data.This work will classify the unstructured complex semantics into their respective domains using the NLP method.Also, by using -gram detectors, 1-dimensional convolving filters are employed, each of which focuses on a certain family of closely related -grams.The appropriate -grams are extracted for decision-making through max-pooling over time.Based on data from Max-pooling, the rest of the network extracts hidden or complex semantics from unstructured text.

METHODOLOGY FOR SENSING COMPLICATED MEANINGS FROM UNSTRUCTURED TEXT FROM UNSTRUCTURED DATA USING NLP AND CNN TECHNIQUE
In this two-phase proposed work, we are trying to detect complex semantics from unstructured text using NLP and CNN Techniques.In the first phase (NLP) we are pre-processing, filtering, and classifying unstructured text to specific data domains.In the second phase (CNN), we try to figure out how CNN handles text after which we use that knowledge to discover complex semantics in unstructured text.The main aim is to satisfy the objectives that are listed as follows: a.To classify unstructured text into their respective domains using the NLP method.b.As -gram detectors, 1-dimensional convolving filters are employed, each of which focuses on a certain family of closely related -grams.c.The appropriate -grams are extracted for decision-making through max-pooling over time.d.Based on data from Max-pooling, the rest of the network extracts hidden or complex semantics from unstructured text.In this experiment setup, we are taking sample input datasets from computer science domains such as Databases, Operating systems, and Data mining unstructured text files in .txtformat.Tokenization, stop word removal, and rare word removal functions must be applied to the input unstructured documents as part of the pre-processing step.Pre-processing eliminates missing or inconsistent data values brought on by human or technological faults.Pre-processing can make a dataset's accuracy and quality more precise, dependable, and consistent.The proposed work will be carried out in two distinct phases.The detailed procedures for both Phase 1 along Phase 2 are outlined below.Figure 1 1).Step 5: In cases where multiple higher-order -grams   encompass   ,   that makes reference to the highest frequently encountered   , then the remaining -grams in  are merged based on two rules: firstly, plural tokens are filtered out if their singular form is also present as shown in Figure 1, and secondly, current participle of a normal verb is not used if there is an alternative form that does not include it.
Step 6: Among the remaining candidates in , those without a corresponding  entry is filtered out.
Step 7: Initial context information denoted as   , is generated, and the documents   are classified using specific-domain.

Phase 2. Sensing complicated meanings
Following is how CNN works for text processing.The three-layer CNN framework is used in this work's implementation.CNN's fundamental capabilities are analogous to those found in the visual-cortex in the brains of animals.CNN performs admirably in text-classification tasks.Classifying texts follows a process similar to those of categorizing images, with the exception that words are represented by vectors within a matrix rather than pixels.

Target-function
Learning-capable neuron biases and weights are used throughout target-function implementation.To generate an outcome, neurons receive multiple inputs, carry out a weighted average over those inputs, and then send that value through an activation mechanism.By sending the network's outcome through the softmax layer, a loss-function can usually be determined for the entire system.The outcome of a network with a softmax layer, which has full connectivity, is down-sampled.

Representation
CNN's initial level consists of an embedding level, which transforms word-indices into threedimensional vectors.These vectors are discovered using the equivalent of a lookup-table.When considering a sentence represented as , and words represented as , each word is translated through its associated embedding and the highest sentence size   is used to determine the vocabulary-size.Once all the words have been converted to vectors, they are sent through the convolution-layer.

System structure
The developed architecture has three distinct stages.The architecture consists of two layers: the Embedding level, that maps words to embedded vectors, and then the Convolution level, that performs most part of the approach work.The sentence matrix is processed by a set of preset filters, which reduces its dimensions.The softmax level, the final one, acts as a downsampling level that can both reduce the sentencematrix and compute the loss-function.The sentence's word-embedding can be obtained using the embedded word lookup-table.To ensure that every single sentence is handled fairly, the matrix produced by the embedding component remains padded.Once the filters have been established, the matrix continues to be reduced and convolved-features are going to be generated.These complex characteristics are finally simplified.As a next step in down-sampling, the resultant data from the convolved-features is distributed across the maximum pooling level.Various sized and shaped filtration are specified.Three, four, and five are the filter shapes employed by the suggested approach.Following that, padding is applied to each of the Sensing complicated meanings from unstructured data: A novel hybrid approach (Shankarayya Shastri) 715 embedded sentences so that the resulting sentence matrix possesses an identical shape and size.Word vectors  1 , … ,   ∈   are the result of embedding every symbol in the -words text being entered in the form of d-dimensional vector data.The generated  ×  matrix can be utilized to transmit a sliding-window across the text within a convolutional level.In accordance with each l-word -gram.
where matrix  ∈  × .Max-pooling applied along the -gram dimension yields  ∈   .The non-linearity of rectified linear unit (ReLU) is used to process   .The distribution across the classes used for classification is then generated by a linearly fully connected layer  ∈  × , that then outputs the class with the highest strength.In execution, we employ a range of window widths, from  ∈ ,  ∈ , by chaining together the outputs   vectors of numerous convolution levels.It is important to take into account that the procedures described here also work for dilated convolutions.This is represented as ( 2) to (6).

Identification of important features
According to conventional wisdom, filters can be thought of as -gram detectors, with every filter looking for a unique category of -grams and marking them with high-scores.After the max-pooling process, only the -grams with the most favorable scores remain.Once the total number of -grams within the maxpooled vector (which is represented through the collection of matching filters) has been determined, a conclusion can be drawn.Any filter's high-scoring -grams (compared with the way it ranks similar -grams) should be considered to be particularly useful for text classification.In this subsection, we enhance this perspective by posing and aiming to respond to the following issues: what data underlying -grams can be obtained within the max-pooled vector, and in what manner is it utilized in the last classification?

Informative vs. uninformative 𝒏-grams
The pooled vector , that belongs to the -dimensional real space   , i.e.,  ∈   serves as the foundation for the classification process.Every value of   is derived through the ReLU applied to the maximum inner product between the -gram   along with the filter   .These values could be attributed to the specific -gram   , which consists of a sequence of words [  , … ,  +−1 ] , which activated the filter.The collection of -grams that contribute to the overall probability distribution  can be denoted as   .-grams that are not present in the collection   cannot have any effect on the decision-making process within the classifier.However, it is imperative to consider the presence of -grams within the set   .In prior investigations into the prediction-based analysis of CNN for text, the focus has been on identifying the -grams found within the input sequence, denoted as   , and evaluating their respective rankings as a method of understanding the underlying prediction process.In this context, we adopt an additional complicated perspective.It is important to highlight the following: the ultimate categorization process does not necessarily consider the specific -gram identities, but rather evaluates these individuals based on the rankings allocated through the filters.Therefore, it is imperative that the data contained in variable  is contingent upon the allocated rankings.From a conceptual standpoint, the -grams within the set   can be categorized through two distinct classes: accidental and deliberate.The presence of deliberate -grams in   can be attributed to their higher rankings assigned by the filtering mechanism.This suggests that these n-grams possess valuable information that is relevant to the ultimate decision-making process.In contrast, it is observed that accidental -grams, regardless of possessing a relatively low ranking, manage to find their way into the set   .This occurrence can be attributed to the absence of an additional -gram that achieved a higher ranking compared to them.Based on the analysis conducted, it is evident that the -grams in question do not appear to possess significant informational value in relation to the classification selection at hand.Is it possible to distinguish and differentiate between intentional and unintentional -grams?It is postulated that within the framework for every filter, a discernible threshold exists.Values surpassing this threshold are  (7) Based on our empirical findings, we determine that an ideal purity-value of 0.75 is optimal when determining the threshold of a given filter.Further, the results of the proposed work have been evaluated using three datasets and in terms of recall, precision, and macro averaged F1-score.The results are discussed in the next section.

RESULTS AND DISCUSSION
This section commences by providing a comprehensive overview of the system requirements, followed by a detailed examination of the dataset employed in the study.Additionally, the performance metrics utilized to evaluate the system's efficacy are thoroughly examined.The results obtained from the proposed methodology were subsequently compared to those of previous studies, specifically on the basis of recall, precision, and macro-averaged F1-score.The inclusion of a comprehensive discussion section within the present work serves to provide a thorough analysis and interpretation of the obtained results.

System requirements, datasets and performance metrics
In this section, we conduct a series of experiments on Phase 1 and Phase 2. The code was written in Python and a system having Windows 10 operating system, 16 GB RAM was used for executing the code.In this approach three different datasets namely data mining (DM) [20], operating system (OS) [21], and data base (DB) [22], [23] datasets are used for experiment purposes.The DM, OS, and DB were created in [24].The performance of the presented approach AAUT-ML is investigated with different existing available approaches i.e., YAKE [15], TF-IDF [25], and TextR [26].All the results and datasets have been taken from [24].Results from the provided AAUT-ML are analyzed, and metrics such as recall, precision, and F1-score are used to determine how well the model performs.By dividing the sum of all anticipated sequences within a positive category by the precision, we can determine how many positive categories were successfully anticipated.In (9) can calculate recall, that can be described as the proportion of correctly predicted positive outcomes compared to actual positive outcomes.In machine learning, the F1-score is a crucial evaluation metric.It neatly summarizes a model's prediction ability through the combination of the precision and recall measures, which are frequently assessed using (10).
1 −  = 2 × × + (10) where   is the sum of all the predicted key-phrases that were found to be a good match with the standard key-phrases, and   is the sum of all the predicted key-phrases from the document.

Precision
In Figure 2, the precision has been evaluated and compared with the YAKE, TF-IDF, and Text-R models.For the DM dataset, the AAUT-ML performed better when compared to YAKE, TF-IDF, and TextR by 66.21%, 87.66%, and 98.65% respectively.For the OS dataset, the AAUT-ML performed better when compared to YAKE, TF-IDF, and TextR by 66.74%, 84.64%, and 97.57% respectively.For the DB dataset, the AAUT-ML performed better when compared to YAKE, TF-IDF, and TextR by 33.20%, 84.52%, and 97.35% respectively.The proposed AAUT-ML has achieved better results for precision in comparison to the YAKE, TF-IDF, and TextR.

Recall
In Figure 3, the recall score has been evaluated and compared with the YAKE, TF-IDF, and Text-R models.For the DM dataset, the AAUT-ML performed better when compared to YAKE, TF-IDF, and TextR Sensing complicated meanings from unstructured data: A novel hybrid approach (Shankarayya Shastri) 717 by 48.17%, 81.70%, and 96.34% respectively.For the OS dataset, the AAUT-ML performed better when compared to YAKE, TF-IDF, and TextR by 62.91%, 51.65%, and 3.97% respectively.For the DB dataset, the AAUT-ML performed better when compared to YAKE, TF-IDF, and TextR by 28.67%, 80.51%, and 96.69% respectively.The proposed AAUT-ML has achieved better results for recall in comparison to the YAKE, TF-IDF, and TextR.In Figure 4, the macro averaged F1-score has been evaluated and compared with the YAKE, TF-IDF, and Text-R models.For the DM dataset, the AAUT-ML performed better when compared to YAKE, TF-IDF, and TextR by 64.93%, 72.22%, and 96.52% respectively.For the OS dataset, the AAUT-ML performed better when compared to YAKE, TF-IDF, and TextR by 61.88%, 86.06%, and 95.28% respectively.For the DB dataset, the AAUT-ML performed better when compared to YAKE, TF-IDF, and TextR by 30.79%, 82.88%, and 99.61% respectively.The proposed AAUT-ML has achieved better results for macro averaged F1-Score in comparison to the YAKE, TF-IDF, and TextR.

Discussion
Table 1 shows macro averaged F1-score, recall, and precision values for four different methods (AAUT-ML, YAKE, TF-IDF, TextR) applied to three different datasets (data mining, operating system, data base).For the "operating system" dataset, it is seen that AAUT-ML seems to perform the best across all metrics, with high F1-score, recall, and precision values.Also, in the "data mining" dataset, AAUT-ML performs relatively well compared to other Finally, for the "data base" dataset, YAKE and TF-IDF have similar F1-scores, recall, and precision values, with AAUT-ML performing slightly better.This work also tried to utilize the sparemax and fusedmax but both these failed to achieve higher results in comparison to the softmax.

CONCLUSION
In this work, sensing complex meanings from unstructured data using natural language processing and convolution neural network techniques has been presented.In this analysis, different datasets namely data base, data mining, and operating systems datasets are used.Our study has challenged some conventional assumptions about how CNNs process and classify text.Firstly, we have demonstrated that max-pooling over time introduces a thresholding effect on the output of the convolution layer, effectively distinguishing between relevant and non-relevant features for the final classification.This insight allowed us to identify the crucial -grams for classification, associating each filter with the class it contributes to.We have also highlighted instances where filters assign negative values to specific word activations, leading to low scores for -grams containing them, despite otherwise having highly activating words.These findings contribute to enhancing the interpretability of CNNs for text classification.Our approach effectively categorizes various documents and their respective domains.We evaluated its performance using metrics such as precision, recall, and F1-score across multiple datasets, demonstrating superior results compared to existing methods.The performance of the presented work shows better results in comparison to the existing works.For future work, this work can be used for classifying corpus semantics in structured data.Different feature extraction processes can be used for structuring the data.Also, along with NLP, machine learning can be used.
provides a comprehensive blockdiagram of the novel developed architecture.

Figure 1 .
Figure 1.Novel developed block-diagram for NLP and CNN for Text classification and detection of complex semantics

Figure 2 .Figure 3 . 4 . 4 .
Figure 2. Macro-averaged precision scores of AAUT-ML versus three baseline methods on three different datasets Given a collection of unstructured text files, denoted as  = { 1 ,  2 ,  3 , …   }, describing a collection of concepts denoted as  = { 1 ,  2 ,  3 , … .  }, where both n and m are greater than 1, the objective is to identify and categorize the most pertinent concepts from .The initial step involves tokenizing the unstructured textual documents, represented as   ∈ , into -grams that serve as the preliminary candidate concepts, denoted as   ∈ .Only meaningful unigrams ( >= 1) are retained only if it occurs at-least   times more frequently than any larger -gram containing the same unigram within it (refer to Figure

Table 1 .
Comparative study