Identification of monolingual and code-switch information from English-Kannada code-switch data

ABSTRACT

There are various ways to write the code-switch text in social networking websites. For example, [E3] refers to one way of writing the code-switch text (English and Kannada both the languages are used in the same sentence).
[E4] is another way of writing the code-switch text (source and script both are different), source language is Kannada and scripting language is English. The relevant literature on this topic has been carried out by several researchers who experimented on language identification (identification of monolingual and code-switch information) in the recent past. Research on code-switch started from the year 1970, and few hypotheses are proposed. This leads to the motivations behind the study of code-switching. Some of the examples includes the markedness model [3], diglossia [7], communication accommodation theory [8] and conversational analysis model [9].
Ahmed et al. [10] proposed an efficient way for language identification using a cumulative frequency addition (CFA) of N-grams. However, in this approach more testing is required on large datasets to evaluate the performance of CFA. Rosner and Farrugia [11] discussed language identification in English-Maltese code-mixed data and achieved nearly 95% of the accuracy. However, numeric and punctuation entities were completely ignored in this approach. If it includs these, which will help to build more accurate model. Solorio and Liu [12] predicted the possible code-switch points in Spanish-English code-switch data. They trained various learning algorithms using transcription of code-switch speech. The average values for the code-switch sentences produced by machine learning approach were near to the values of those produced by the humans. Further, the accuracy can be improved by including a multi-word expression recognition system. Piergallini et al. [13] performed word-level language identification and prediction of code-switch point from Swahili-English code-switch data. The proposed approach achieved high accuracy in language identification and moderate improvement in code-switch point prediction. This approach needs to focus on social analysis of code-switch behaviour like the association between linguistic accommodation and power or code-switch and social solidarity.
Yirmibeşoğlu and Eryiğit [14] proposed a system to identify the code-switching between Turkish-English by using character level n-gram and conditional random fields (CRFs) have achieved 95.6% of micro-average F1-score. However, still there is a scope for improvement in F1-score by increasing the corpus size. Barman et al. [15] presented a preliminary study on instinctive language identification with Indian language code-mixing from social media messages. They performed word-level classification on Bengali-Hindi-English code-mixing data and concluded that character n-gram features are useful for language identification in code mixed data. Word-level code-mixing completely ignored in this work. Veena et al. [16] developed a system for word-level language identification from Tamil-English and Malayalam-English code-mixed data. This approach is executed based on character-based embedding with context information and achieved 93% of accuracy on Malayalam-English and 95% of accuracy on Tamil-English code-mixed data. Further, more features can be used to improve the system performance.
Mave et al. [17] examined different code-switching metrics and found that CRF model performed better with the boundary of 2-5 percentage for Spanish-English and 3-5 percentage for Hindi-English in comparison with deep learning model. This work can be extended to match the code-switching manners from various domains like chat conversations, song lyrics, and movie scripts across various language pairs. Gundapu and Mamidi [18] presented a study on different models for language identification in English-Telugu code-mixed data. It is found that CRF model outperformed for word-level language identification with 0.91 F1-score. Mandal and Singh [19] tested a multichannel neural networks on two different code-mixed languages such as Bengali-English and Hindi-English. They attained 93.28% of accuracy on Bengali-English code-mixed data and 93.32% of accuracy on Hindi-English code-mixed data.
Singh et al. [20] build an automatic named entity recognition (NER) system for Hindi-English code-mixed data and the proposed system outperformed with 33.18% of F1-score in comparison with existing baseline systems. However, this work can be extended to build natural language processing (NLP)  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 13, No. 5, October 2023: 5632-5640 5634 models like entity-specific sentiment analysers or semantic role labelling which use of NER for code-mixed data. Das et al. [21] presented a supervised learning model for word-level language identification in Bengali-English code-mixed data. Two types of word encoding methods such as character and phonetic are used along with stacking and threshold techniques. The stacking method achieved 91.78% of accuracy and Threshold method achieved 92.35%. Chaitanya et al. [22] tried to identify different languages from Facebook code-mixed comments and compared two different word embedding methods such as continuous bag of word (CBOW) and skip-gram. Shekhar et al. [23] applied two different assessment models such as statistical and neural-based learning models on Hindi-English code-mixed data. The outcome of proposed model illustrates that the word embedding is capable to spot the language parting by identifying source of the word and similarly mapping to its language label.
Very few researchers are focused for language identification in English-Kannada code-switch (EKCS) data. Lakshmi and Shambhavi [24] addressed the problem of word-level language identification for English-Kannada code-mixed data. Performed various supervised classifiers and found that the dictionary based model is better to handle word-level code-mixed in English-Kannada data. However, identifying monolingual and code-switch information from EKCS data problem needs to be addressed. James et al. [25] provided an investigational evidence to demonstrat that the accuracy of cloud-based multilingual systems (Google and Microsoft Azure) is low when identifying Maori language. The proposed study shows that, the BiLSTM with bilingual embeddings to identify Maori-English code-switching points with an accuracy of 87%. Hybrid models using hand-crafted rules based on the phonotactic variances between the deep learning techinques and languages can improve the performance of the proposed approach.

METHOD 2.1. Pre-processing and annotation
To implement monolingual and code-switch information identification task, 10,396 EKCS comments are collected from YouTube.com and these comments are written using English script. There are two types of comments in the collected dataset, comments that are having the combination of monolingual words (i.e., English words written in English script) and code-switch words (i.e., Kannada words written in English script) and the comments that are completely with code-switch words. Pre-processing is a fundamental and important task in NLP, since the performance of any model depends on quality data. To generate quality data, it is required to remove the noisy data from the dataset such as unwanted symbols, special characters, and digits since this noisy data will not play any significance role in some of the tasks like language identification and part of speech (POS) tagging. Hence, we performed pre-processing task on EKCS dataset to remove noisy data. Frist, removed digits, special characters, and emojiees from the dataset and then converted into lower case to bring the uniformity in the text. After the pre-processing, tokenization task is carried out to split each comment into tokens. Finally, 123,249 tokens are presented in the EKCS corpus out of 10,396 comments.
Once the tokens are generated, next step is annotation i.e., assign each token to its relevant class. Annotation is required to train and test the performance of supervised techniques in machine learning. The entire corpus is annotated with six classes such as monolingual (MN), code-switch (CS), names (NE), mix of monolingual and code-switch as MIX, location (LC), and remaining all other tokens as unknown (UN). Table 1 shows the sample data of EKCS corpus. Table 2 shows the statistics (number of tokens in each class) of EKCS corpus. Motive of this work is to identify monolingual and code-switch information, so the corpus has a smaller number of tokens in NE, LC and UN classes in comparison with code-switch and monolingual classes and particularly number of tokens in MIX class is the least (448) since the mix of monolingual and code-switch is a rare occurrence in social media text.

Proposed methodology
The proposed approach called character level n-gram is discussed in this section to identify monolingual and code-switch information from EKCS corpus. Four types of supervised learning approaches such as naïve Bayes (NB), support vector classifier (SVC), logistic regression (LR), and neural networks (NN) are implemented with two types of feature extraction such as word-level term frequency-inverse document frequency (TF-IDF) and character level n-gram. To implement proposed method, the EKCS corpus needs to be divided into two parts such as train dataset and test dataset. Performed 80:20 split by using sklearn.model_selection.tain_test_split on the EKCS corpus. Here, 80% of data is used to train the various models and 20% of the data is used to test the performance of the models. Figure 1 shows the proposed approach for language identification. Totally 24,650 tokens are used as input to test the performance of proposed model and the input dataset is pre-processed. Table 3 shows the statistics of input dataset.   In next step, feature extraction from the input dataset is carried out. The process of transforming the text data into features is called feature extraction. The TF-IDF is a technique which provides numerical values for text data and also gives the significance of a specific words in the corpus. Calculating term frequency is the first step in TF_IDF process and this can be done by using the (1). where, W is the frequency of a word in the document, and N is the total no of words in the document. Once the term frequency is calculated, next step is to calculate the inverse document frequency and this can be done by using the (2).
where, D is the total no of documents, and WD is the no of documents containing the word. Finally, we can calculate TF-IDF value by using the (3).
A character level n-gram is an order of n characters in a word and n is the number of characters in the sequence. Algorithm 1 shows the process of generating character level n-grams for each token. Algorithm 2 shows the process of features extraction by using char_ngrams( ) and byte-pair encoding (bpemb_en.encode()) functions.

Algorithm 1. Character level n-gram generation
Input: word, n/*word-input text and n-Number of characters in the sequence */

RESULTS AND DISCUSSION
Various supervised classification methods such as NB, SVC, LR and NN are compared with two feature extractions such as word level TF-IDF and character level n-gram. The comparison is made based on two parameters such as accuracy and F1-score. It is observied that, the proposed method character level n-gram is more efficient interms of accuracy and F1-Score in comparison with word level TF-IDF. Table 4 shows the comparison of word level TF-IDF F1-score for various classifiers. It is observed that, all four classifiers are producing almost similar F1-score value with variation of about 1% for code-switch class. LR produces slightly less F1-score i.e., 93% in comparison with other classifiers for monolingual class. Surprisingly NB classifier produces 0% as shown in Table 4 for mixed class, since the number of tokens is less about 96 as shown in Table 3 in comparison with other classes. Further, rest of the classes, NB produces less F1-score in comparison with other classifiers as per Table 4 since NB works well for large datasets.  Table 5 shows the comparison of character level n-gram F1-score for various classifiers. It is observed that, all four classifiers are producing almost similar F1-score for code-switch class. NB produces slightly less F1-score (96%) value in comparison with other classifiers for monolingual class as shown in Table 5. NB and LR performed better i.e., 47 and 42% respectively for mixed class in comparison with word level TF-IDF. If we considered overall, there is an improvement in F1-score with character level n-gram for all the classes in comparison with word level TF-IDF.  Table 6 shows the F1-score comparison between word level TF-IDF and character level n-gram for each classifier. SVC and NN performed better with character level n-gram feature extraction (98%) in comparison with word level TF-IDF followed by LR classifier (97.9%) and NB (96.1%). Figure 2 shows the improvement of F1-score for various supervised classifiers. From the Figure 2, it is evident that a good amount of improvement in LR classifier with character level n-gram (an improvement of 3.8%) followed by SVC and NN with 1.7% and NB with 1.6%.  Figure 2. F1-Score improvement Table 7 shows the accuracy comparison of word level TF-IDF and character level n-gram for each classifier. SVC and NN are performed better with character level n-gram feature extraction (97.9%) in comparison with word level TF-IDF followed by LR classifier (97.8%) and NB (96%). Figure 3 shows the improvement of accuracy for various supervised classifiers and it is evident that a good amount of improvement in LR classifier with character level n-gram (an improvement of 4.1%) followed by NB (2.2%), SVC and NN are with 1.8% respectively.  Figure 3. Accuracy improvement

CONCLUSION
In this work, we performed "Identification of monolingual and code-switch information from English-Kannada code-switch data" and conducted an experiment with various supervised classifiers by using two feature extraction techniques such as word level TF-IDF and character level n-gram. From the proposed

BIOGRAPHIES OF AUTHORS
Ramesh Chundi received the B.Sc. degree in computer science and MCA degree from Sri Venkateswara University, India, in 2004 and 2007, respectively. Currently pursuing Ph.D. degree in computer science and Applications from REVA University, India. His research interests include natural language processing (NLP), aritificial intelligence (AI), machine learning, deep learning, data analytics, and data mining. He can be contacted at email: chundiramesh@gmail.com.

Vishwanath R. Hulipalled
is a Professor in the School of Computing and IT, REVA University, Bangalore, Karnataka, India. He completed BE, ME and Ph.D. in Computer Science and Engineering. His area of Interests includes machine learning, natural language processing, data analytics and time series mining. He has more than 24 years of academic experience and research. He authored more than 50 research articles in reputed journals and conference proceedings. He can be contacted at email: vishwanth.rh@reva.edu.in.

Jay Bharthish Simha
is the CTO of ABIBA Systems and Chief Mentor at RACE Labs, REVA University. He completed his BE (Mech), M.Tech (Mech), M.Phil.(CS) and Ph.D.(AI). His area of interest includes fuzzy logic, soft computing, machine learning, deep learning, and applications. He has more than 20 years of industrial experience and 4 years of academic experience. He has authored/co-authored more than 50 journal/conference publications. He can be contacted at email: jay.b.simha@reva.edu.in and jay.b.simha@abibasystems.com.