Improving misspelled word solving for human trafficking detection in online advertising data

ABSTRACT


INTRODUCTION
Human trafficking is a global problem that affects victims all around the world [1].Women, men, and children can all fall victims to this crime, but women and children are particularly vulnerable due to their physical weakness compared to men.In the past, pimps used face-to-face communication to reach their customers, but this method made it easier for law enforcement authorities to track them down.Nowadays, pimps have shifted to use the internet to conduct their business because it is more convenient and secure [2].Online advertisements are the primary means through which pimps and customers connect, making it a challenging task for law enforcement authorities to track them.Therefore, the use of a machine learning (ML) model to detect human trafficking activities in online advertisements is key to stop this vicious cycle.Such models have gained attention due to their abilities to efficiently handle large amounts of data and automate the screening process, ultimately reducing workload and time.
Machine learning [3] is one branch of artificial intelligence that focuses on teaching machines through data to build models to achieve goals, such as text classification, face detection, and face recognition.Datasets can be collected in various ways, including application programming interfaces (API) [4], organizations, and crowdsourcing [5].However, the quality of the data strongly affects model performance, 6559 such as misspelled words [6].Several tools are required to solve this problem, including Pyspellchecker [7], autocorrect [8], and TextBlob [9].Unfortunately, these tools are still lacking consideration for analysis of words and their contexts relevant to the domain of interest, which may cause prediction errors, indicating a need to improve data preparation processing.This paper presents an approach to solving the misspelled words problem in human trafficking datasets.The approach detected and analyzed misspelled words and their contexts, then replaced them with suitable alternatives.For our experimentation, we used the Trafficking-10k dataset [10] provided by Marinus Analytics, which contains many misspelled words due to being collected from online advertisements.The experimental result shows that the proposed approach can improve an accuracy of models in detecting human trafficking activities in advertisements.The rest of the paper is organized as follows: related works are discussed in section 2, section 3 briefly describes the dataset used in the experiment, section 4 proposes the approach, section 5 describes the experiment and its results, section 6 discusses the experimental results, and finally, section 7 discusses the conclusion and future work.

RELATED WORKS
A model for detecting human trafficking activities has garnered attention for strengthening the detection of the online illegal activities.McAlister [11] proposed an approach to collect data related to human trafficking in Romania.This method involved scraping data to collect adult services advertisement in English, which was then translated into Romanian for use in searching for similar advertisements in Romania.Roshan et al. [5] developed a simple tool for data collection from a mobile application to detect human trafficking.The mobile application collected text and images while the identities of senders were concealed.Suspected information about human trafficking activities was then reported to the authorities for further investigation.However, this work was found to have repeated reports of alteration, which raised questions about the reliability of the data and the limitations of data cleansing, indicating an insufficient potential for practical usage.Mensikova and Mattmann [12] proposed an approach that applied a sentimental analysis to create a model for identifying advertisements potentially connected to human trafficking activities.However, the reliability of the dataset and the possibility of age-based biases led to a question in data prediction errors.Hernandez-Alvarez [13] demonstrated an approach for detecting human trafficking activities on Twitter.This method involved collecting data from Twitter using an API, then cleansing the data and selecting relevant features before creating a model.The model showed the potential for detecting human trafficking activities, but concerns were raised about the completeness of machine-based feature selection, indicating a need for the further improvement of the approach.
Additional processing in data preparation is gaining the interest for completing feature selection and improving model performance.Diaz and Panangadan [14] introduced an approach for preparing the additional datasets for specific human trafficking cases.The dataset was firstly reviewed online by domain experts in sexual trafficking businesses.This certified dataset was then a combination of data from Rubmaps and Yelp, obtained using natural language processing (NLP).Rubmaps is a business review website that identifies businesses engaged in illegal activities.On the other hand, Yelp is an online review forum that provides data location.The approach combined datasets based on location data.The result of this approach was the ability to identify illegal sexual trafficking activities in specific locations.The use of a language model to identify features for model creation was proposed by Zhu et al. [15].The model showed the capability to detect advertisements related to human trafficking activities.
Most of the reported approaches were developed based on text classification, omitting the feature of emojis [11]- [15].This could reduce the reliability of model predictions when it was applied to real human trafficking activities online.As emojis are another factor expressing intention directly [16], Whitney et al. [17] presented an approach using knowledge management and NLP to explore emojis related to human trafficking activities.Unfortunately, the method was not applied to the further creation of the detected model.Additionally, the datasets used in most reports were not labeled by domain experts [11], [13], [15], reflecting that those results from the reported models could be questioned.
To improve the justification of data predictions obtained from the model, Wiriyakun and Kurutach [18] demonstrated the use of local interpretable model agnostic explanations (LIME) [19] to filter words related to human trafficking as an alternative approach to select features for model creation.The model showed the potential to detect advertisements relevant to human trafficking activities.However, the method only considered text classification, indicating a need for further modification of the model to include emoji feature selection to obtain the improved accuracy of model predictions.
To the best of our knowledge, there is no ML model for alerting human trafficking activities that considers an approach to improve the accuracy of data prediction based on the correction of the missing information such as misspelling words and emojis.This paper presents a novel approach for data preparation processing of the suspected datasets related to human trafficking activities.The dataset was prepared by several steps that should be improved in terms of all aspects related to feature selection, such as text, emojis, and misspelled words, to achieve reliable model predictions for detecting human trafficking activities.The proposed approach was then applied with the labeled dataset, Trafficking-10k [10], to demonstrate the performance of model prediction on human trafficking activities, using the percentage of accuracy as an indicator.

DATASET
The four datasets used in this work were Trafficking-10k [10], emoji [17], dig-dictionaries [20], and an English word list [21].Trafficking-10k is a dataset that contains the online adult service advertisements from backpage.com,which is provided by Marinus Analytics.Marinus Analytics is a company aiming to develop technology to disrupt human trafficking, child abuse, and cyber fraud.The dataset was labeled to identify the potential for human trafficking activities in the advertisements by domain experts.The level definitions are shown in Table 1.
The emojis used in this work were identified as those that pimps use instead of their messages to prevent the detection by law enforcement authorities.They use emojis to reference meanings such as price, young girls, and women's virginity.The emoji definitions are shown in Table 2.
A dictionary of the categorized words related to adult services by the University of Southern California [20] was used as a reference vocabulary database.Dig-dictionaries are dictionaries that collect the vocabulary from the internet, then classify those words into categories.In this work, we used the five categories which were cup size, escort service, place, nationality, and person name, of dig-dictionaries for classified the dataset.Examples are shown in Table 3  The English word list is a list containing words.This work used it to compare words in the Trafficking-10k dataset to find misspelled words.The misspelled word will be replaced with another English word that has the same meaning by this work.

METHOD
This work develops an approach for data preparation that can solve spelling mistakes in text classification and recover the missing information from the advertisements suspected to human trafficking activities.The dataset obtained from Trafficking-10k was processed using the proposed approach for preparation of the data before running by the conventional models later to demonstrate an improvement of the model's accuracy.The proposed approach consists of nine main steps as shown in Figure 1.Each step is explained in the following subsections as shown in subsections 4.1 to 4.9.
Figure 1.An overview of the proposed approach for data preparation

Tag advertisement contain emoji indicator
The dataset was first processed by this task involved tagging advertisements that comprised of emojis, which were used as indicators by pimps in their advertisements.The tags supported the detection of the model because they referenced the important data that pimps wanted to hide from detection, such as money, women, and girls.The indicators are shown in Table 2, while Table 4

Tag words that found in dig-dictionaries
This work used words in escort services, cup sizes, places, nationalities, and personal categories from dig-dictionaries because these categories are often associated with human trafficking activity.The task involved tagging advertisements containing words from these categories.This task is shown in algorithm 1.

Emoji decoding
Emojis are used in advertisements because they can convey mood, and pimps use emojis in their advertisements to promote their business.Emojis can often convey more meaning than words.However,  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 13, No. 6, December 2023: 6558-6567 6562 traditional text classification models often ignore emojis and focus only on words, allowing an escape route for spotting the human trafficking advertisements.To address this issue, this task involved decoding emojis into words based on the definitions provided by the Unicode Consortium [22].Table 5

Data extraction
The pimp's advertisements contained a lot of data, such as links and images, to attract their customers.Therefore, this data can be used as a feature for creating a model.To extract data from the advertisement's content, this task used regular expressions [23]

Manage unnecessary data
The dataset generally contained a lot of unnecessary data, resulting in a waste of time during processing.We used well-known techniques to manage the data and solve this problem.These techniques are described: i) converted to lowercase, ii) added space between emoji and html tags, iii) removed HTML Tags, iv) changed abbreviation words to full form, and v) replaced all numbers from text content by "[[number]]".

Calculate words in advertisement
This step involved a calculation of an occurrence of English words [21] in Trafficking-10k and then created a dictionary.The keys were expressed as English words, and the values were listed to indicate the occurrence times in three different indexes; index 0 represented the total occurrence in class 0, index 1 represented the total occurrence in class 1, and index 2 represented the total occurrence in all classes.Table 6 shows some examples of keys and values.The task can be seen in algorithm 2.

Mask misspelled word
For this task, the misspelled words that were not in the dictionary from the previous step were detected.Then, those misspelled words were replaced with "[MASK]".The task as shown in algorithm 3.These mask tags were later corrected in the next step.

Predict mask word
This step uses bidirectional encoder representations from transformers (BERT) [24] to predict words then replaces the masked words obtained from the previous task.BERT is a pre-training model that is widely used to train unlabeled data with no specific to any particular domain.Sometimes, the replaced words were not popular in the domain of interest.For example, BERT may replace the word '[MASK]' with 'beautiful' in the sentence "Our hotel has [MASK] woman".This resulted that the sentence in the advertisement would be read as "Our hotel has beautiful woman", which did not reflect to human trafficking purpose.Commonly, pimps prefer to use the word 'sexy' instead of 'beautiful' in their advertisements as it is more appealing to their customers.To address this issue, this task combined the probability of words in Trafficking-10k with BERT.This was calculated using (1).
where P(wi) is probability of word index i from pre-trained, (   ) is probability of word index i in the advertisement class.

Remove stop word
The advertisements also contained words that did not affect the model's decision.Therefore, this task involved deleting the stop words from the advertisements to reduce training time and to improve the model's performance.Table 7 shows some example results of this task.The proposed approach was used to process the data suspected to human trafficking activities for recovering the missing information from misspelled words and emojis including unnecessary data removal.The processed data by this approach was then used to create the model to demonstrate its predictive performance of human trafficking alert.

EXPERIMENT
The experiment used the Trafficking-10k dataset to demonstrate the efficiency of the proposed approach for data preparation in a model creation.The results were then compared to those obtained from models used in other works [25]- [27].The Trafficking-10k dataset, which was labeled as the seven levels as shown in Table 1, was relabeled into two classes: Class 0 representing levels 0-2 (not related trafficking) and Class 1 representing levels 4-6 (related trafficking).The advertisements marked as level 3 were excluded from the experiment because they were considered noise in classification.The experiment used a total of 7,698 advertisements, divided equally between the two classes.To create the models, the same learning parameters and dataset were used.Accuracy was employed to measure and compare the performances, calculated by (2).The results are shown in Table 8.The confusion matrix of KNN, NB, and MLP by using the proposed approach are shown in Figures 2 to 4, respectively.

RESULTS AND DISCUSSION
Based on the results shown in Table 8, the models using the proposed approach for data preparation processing achieved accuracy rates of 62.34%, 69.35%, and 75.32% when using KNN, NB, and MLP, respectively.In contrast, the accuracy obtained from other models were in the range of 50.71% to 50.97% (for KNN), 52.40% to 52.73% (for NB), and 51.88% to 52.53% (for MLP).The results indicate that the higher accuracy rates achieved by the models using the proposed data preparation, evidencing the importance of this step for improving model performance.Notably, the misspelled words paralleled with text classifications should be considered for the correction to ensure full coverage of data filtration.
The proposed approach for data preparation has shown the potential for use in real applications for human trafficking detection.The applicability of the proposed approach will be demonstrated by applying it to different datasets for human trafficking detection.Furthermore, this approach could be applied to various domains of interest for data preparation purposes.

CONCLUSION
This work presents the proposed method for solving the problem of spelling mistakes and recovering the missing information to improve the accuracy of model predictions for human trafficking detection.The proposed approach replaced the misspelled words with English words commonly used in human trafficking advertisements.Also, the missing information related to human trafficking activities was recovered by emojis translation.The unnecessary data was then removed to reduce the processing time for machine learning step for model prediction.The experimental results of the model prediction using the data prepared by the proposed method gave higher accuracy compared to those obtained from the other models, indicating the significantly improvement of human trafficking alerting.The proposed approach will be used Int J Elec & Comp Eng ISSN: 2088-8708  Improving misspelled word solving for human trafficking detection in online … (Chawit Wiriyakun)

Table 1 .
. Definition of the level indicating the potential for human trafficking activities labeled by Marinus Analytics

Table 2 .
Emoji related to human trafficking activities

Table 3 .
Examples of word and category from dig-dictionary used for building the model for detection of human trafficking activities

Table 4 .
Tag indicator example

Table 6 .
Example of key and value for words calculation

Table 7 .
Example of stop word removal

Table 8 .
Experiment result showing the percentage of accuracy obtained from each method ISSN: 2088-8708  Improving misspelled word solving for human trafficking detection in online … (Chawit Wiriyakun)