Prediction of addiction to drugs and alcohol using machine learning: A case study on Bangladeshi population

ABSTRACT


INTRODUCTION
Drug addiction, which means the taking of various drugs illegally and being addicted to those drugs for their toxic and addictive effects, is one of the most malignant problems for a country. It can destroy a life and a nation easily. In a developing country like Bangladesh, addiction can bear a terrible effect on our society. According to the report of the daily star newspaper, Bangladesh has become increasingly involved with terrorist groups involved in drug abuse and production, using Bangladesh's territory to smuggle drugs, which pose a threat to our country's youth society. [1]. Near about 25 lac people are drug-addicted. In Bangladesh, about 80 percent of the drug addicts are adolescents and young men of 15 to 40 years of age [2]. Dissatisfaction is the reason for this addiction. Joblessness issues, political upheaval, absence of family ties, absence of adoration friendship offer ascent to disappointment. In order to avoid drug addiction, we need to stay away from drugs. Stay away from drugs will only reduce the risk of getting addicted before one can be addicted to it. Nowadays drug addiction has become a dangerous fact for which the young generation from all lifestyles is affected silently. In 2015, drug-addicted Oishee Rahman killed her parents [3]. Even it is very difficult for a woman to roam around alone in the city because there are many drug-addicted people freely moving inside the city. When we go to a new place, we cannot find out those who are addicted. An addicted friend can destroy the friend circle easily. According to the news of the Dhaka tribune newspaper, there are around 7.5 million people addicted to drugs in Bangladesh. The dangerous thing about them is that 80% of them are the youth and 50% of them are involved in different criminal activities [4]. We need to keep a special focus so that our youth do not become addicted to drugs. Machine learning, a major branch of artificial intelligence (AI) can provide a solution to the problem just discussed above. The applications of machine learning vary on different application domains, e.g. cancer prediction [5], software fault prediction [6], dermatological disease detection [7], and risk prediction [8] and so on. Likewise, different conspicuous machine learning algorithms can be put into use for the work of prediction to drugs and alcohol.
This paper tries to anticipate in advance if someone has the risk of becoming addicted to drugs and alcohol. First, we read relevant articles from different national and international journals, conference proceedings, and magazines and write-ups from different websites and newspapers. Then we talk to doctors and drug-and-alcohol-addicted people and find some driving factors for addiction such as age, gender, profession, health ability, mental pressure, trauma, family-and-friends' history, life-changing incidents. Collecting raw data from both addicted and non-addicted people. We made an arduous endeavor for comparing our results with the results of similar research works even though no work has been observed, which addresses the problem of prediction of addiction to drugs and alcohol.
We have followed and studied related works in the near past done by some other researchers on drugs and addiction prediction and understand the processes and methods expressed by them. Here are some descriptions of recent notable research work on machine learning. Dahiwade et al. [9] proposed a general disease prediction system, which was based on machine learning algorithms. Hegazy et al. [10] proposed a model for stock market prediction with machine learning technology. Alonzo et al. [11] presented a detailed comparison between various machine learning algorithms used prediction and assessment of coconut sugar quality. Haghiabi et al. [12] worked on predicting water quality in the machine learning approach. Zhang et al. [13] proposed a method for predicting daily smoking behavior based on the machine-learning algorithm. They used the extreme gradient boosting (XGBoost) decision tree algorithm and found the best accuracy of 84.11% with maximum depth five. Alaa et al. [14] proposed a machine learning-based model for predicting disease risk of cardiovascular on Biobank participants. Zhu et al. [15] worked on pre-symptomatic detection of tobacco disease with hyperspectral image and machine-learning classifiers.
Zhang et al. [16] worked to predict human immunodeficiency viruses (HIV) prognosis and mortality with smoking-associated deoxyribonucleic acid (DNA) and machine learning classifiers. Granero et al. [17] proposed a model for predicting exacerbations of obstructive pulmonary disease with machine learning features. Frank, Habach, and Seetan [18] worked on smoking status prediction with machine learning and statistical analysis. Logistic regression had the best performance with 83.44% accuracy, 83% precision, 83.4% recall and 83.2% F-measure in their work. Lee et al. [19] worked with a model that predicts alcohol use disorder by checking the treatment-seeking status with a machine learning classifier. Their collected data domains were cognitive, mood, impulsivity, personality, aggression, and early life stress and childhood trauma. Kinreich et al. [20] proposed a model on predicting the risk of alcohol use disorder (AUD) using machine-learning technology. Kumari et al. [21] proposed a model of predicting alcohol abused using machine learning technology. They considered age, gender, country, ethnicity, education, neuroticism, openness to experience, extraversion, agreeableness, conscientiousness, impulsive, sensation seeing as their models' feature. These features considered in ANN-D and day, week, month, year, decade considered in ANN-C. Habib et al. [22] had done a study on Papaya disease recognition based on a machine learning classification technique. This paper is organized as follows: Section 1 describes the introduction. Section 2 gives a short review of the research method. Section 3 explains the result and discussion. Section 4 contains the conclusion.

RESEARCH METHOD
The system architecture of the prediction of addiction to drugs and alcohol is as demonstrated in Figure 1. Here a user has to answer the questions through a web application. The information collected from the user will go to the server and from there to the expert system. The outcome will be determined based on the input received by applying a logistic regression algorithm on the processed data. A definite result will be prepared in terms of the output obtained from the model. The results obtained through specific formatting can be viewed through the web application. We have collected 510 data of both addicted and non-addicted people among them 80% has been treated as training data and 20% as test data. Our data collection and data preprocessing techniques' layout will be shown in the next section. We have used nine machine-learning algorithms mentioned earlier. We have calculated the accuracy three times. The first time accuracy was calculated before using principal component analysis (PCA) on the processed data, and then the second time it was calculated after using PCA and finally, the accuracies were calculated using the algorithm on the unprocessed data. We have evaluated the classifiers based on accuracy and other metrics like sensitivity, specificity, precision, recall, and F1-score. These working processes have been described in the following flow diagram in Figure 2. We have run nine machine-learning algorithms on processed data set where the number of features was 23. Then we have used the PCA that is a kind of feature extraction method to grab the underlying variance of data in orthogonal linear projections. The independent used variable of a model is known as the dimensionality of that model. The number of variables can be reduced using a PCA; only the important variables were selected for the next task. Figure 3 has shown the scree plot where a variance is explained in the y-axis and number features showed in the x-axis. Using the scree plot and 90% variance explained as a threshold, we have calculated our principal component number and the number is 14. Normally it combines highly correlated variables to build up a short artificial set of variables [23].
k-NN is a simple supervised machine-learning algorithm. k-NN algorithm grabs similar things that exist in a close neighborhood [24]. Minkowski distance between the query points to other points is determined by using (1).
Support vector machine is a supervised machine-learning algorithm. Data items are placed in ndimensional space and the values of the features present the particular coordinate [24]. SVM builds a maximum margin separator, which is used for making decision boundaries with the largest possible distance. W is for weight vector and X is for is the set of points. By using (2), we can find out the separator.
Logistic regression uses logistic function and this logistic function serves as a sigmoid function. An s-shaped curve takes the real values and put them between 0 to 1 [24]. The logistic function is given as (3): Naïve Bayes is one of the oldest algorithms of machine learning. This algorithm is based on Bayes theorem and basic statistics. It extends attributes using Gaussian distribution [23]. The Gaussian distribution with mean and standard deviation is described in (4).
MLP means multilayer perception. MLP has a combination of multilayer neurons. The first layer is the input layer, the second layer calls the hidden layer, and the third layer is called the output layer. It takes input data through the input layer and gives the output from the output layer [24]. CART is a distribution-free decision tree learning technique. The decision tree is a tree-based model. The divide-and-conquer method is used for making the tree diagram. The Gini index is applied in CART where Gini index finds out the impurity of D, D represents the training tuples [23]. Gini index is defined as below: Freund and Schapire proposed AdaBoost in 1996. It makes a classifier with a combination of multiple poorly performing classifier. It sets the weight of classifiers and trains the data in each iteration [23]. By using (6), we can compute the error rate of each tuple.
Random forest makes a large collection of de-correlated trees for prediction purposes. It performs split-variable randomization. The random forest has a smaller feature search space at each tree split [23]. Gradient boosting machine builds an ensemble of shallow trees with tree learning and improving technique. GBM works with the principle of boosting weak learners iteratively by shifting focus towards problematic observation. It prepares a stage-wise fashion model like other boosting methods and normalizes them with arbitrary differentiable functions [25]. We not only calculated the accuracy of several algorithms but also calculated sensitivity, specificity, precision, recall, F1-score, and ROC curve and confusion matrix of each algorithm. In the case of model evolution, certain classifiers have been measured based on the test data set for better measurement. Sensitivity is the true positive rate. It is the ratio of how many positive tuples are correctly diagnosed.  Table 1.
To identify the risk of becoming addicted to drugs we have considered each of these factors. We have found out about these factors by talking to various physicians, websites [26]- [30], and articles. The data set is a large collection of necessary and relatable coordinates that can easily be accessed and changed. We have seen someone around us taking drugs but it was a secret, and at the train station and bus station drug addicts refused to help. Then we have decided to go to the drug addiction center and rehabilitation center. We have also collected information from some private rehabilitation centers and clinics. New Mukti Clinic [31] and Brain and Mind Hospital [32] helped us with the information. In addition to providing information, we can learn from their consultants and doctors about many more important factors. Thus, we were able to collect data of 510 people based on 23 factors. There are 305-drug addicts' information and 205 healthy people's information. We have also collected our data from Daffodil International University, Sylhet Engineering College, Begum Rokeya University, New Mukti Clinic, Brain and Mind Hospital, and some other places. Data collection was the most challenging task for us. Nevertheless, we managed to collect some data where there were some missing data, categorical data, numerical and text data. Then we have decided that through data processing we would make this data suitable for different algorithms. Figure 4 has described our data preprocessing work. First, we started the work of data cleaning.  [27] How much you care about yourself [27] Think that drug addiction can be a solution [28] Age [29] Job losing [30] Losing weight [27] Residing address [30] Sexual harassment [27] Have addicted friends [29] Profession [30] Gender [29] Reason to become addicted [29] Distance with friends and family [28] Having odd sleep pattern [30] An addicted person at home [26] Working efficiency [29] Faced any trauma [26] Stress controlling skills [30] Relationship problem [26] Economic status [29] We have checked if there is a null value in the data set. We have then encoded the level that converts the text data to numerical data. We have solved the missing value problem using the imputer and median. Then we have checked if there is a noisy value in the data set using a box plot. We have found some noisy values in our data set. Our data set had a noisy value in the 'age' feature and we solved the noisy value problem with an outlier quantile. Then we have dropped our outcome feature, that was, the addicted column. We finally have the processed data set in our hands.

RESULTS AND DISCUSSION
This section will discuss the results of our research work in detail. For ease of understanding, we will present our work data with the help of some graphs and tables in two sections. Here we will provide a brief comparison with the work of others as well.

Experimental evaluation
A data set is prepared by gathering 510 peoples' data. The statistics have shown that 209 people are addicted because of their friends and 98 people are addicted to drugs for curiosity. Table 2 shows the correlation between the features. Data are highly connected by a positive value and the negative value means that the data is negatively connected and zero indicates that the data does not connect to itself. Besides this, it also shows us how the features correlated with the outcome. Table 3 describes the performance of each algorithm. It reviews the performance of algorithms sensitivity, specificity, recall, precision, and F1-score. Based on this performance of algorithms, we would determine which algorithm will fit best for our problem domain. It has been shown that logistic regression performs the best based on accuracy. Again, based on sensitivity, specificity, recall, precision, the CART performs better. However, after performing unprocessed data and PCA, the CART's performance was not good. So, considering everything, the best performance of the model was found using a logistic regression algorithm. We have used nine algorithms here. Each algorithm uses certain parameters and the value of these parameters varies. The parameter values of all the algorithms for training the model are discussed in Table 4. The values given here we selected their optimal value by experiment. It appears that before using PCA, k-NN has achieved 96.8% accuracy, SVM has achieved 93.75% accuracy, logistic regression has achieved 84.37% accuracy, naïve Bayes has achieved 87.5% accuracy, the random forest has achieved 66.67% accuracy, CART has achieved 50% accuracy, AdaBoost has achieved 69.79% accuracy, MLP has achieved 78.13% accuracy, GBM has achieved 73.96% accuracy. After using PCA, we can see that the accuracy of some algorithm has increased, some has decreased, and some algorithm has remained unchanged. k-NN has achieved 82.29% accuracy, SVM has achieved 95.83% accuracy, logistic regression has achieved 97.91% accuracy, naïve Bayes has achieved 92.7% accuracy, the random forest has achieved 73.95% accuracy, CART has achieved 59.37% accuracy, AdaBoost has achieved 71.87% accuracy, MLP has achieved 72.91% accuracy and GBM has achieved 59.38% accuracy. The difference in the accuracy of the algorithm has obtained before and after the use of PCA is shown in Figure 5.   We have also calculated the accuracy with an unprocessed data set. k-NN has achieved 81.37% accuracy, SVM has achieved 59.01% accuracy, logistic regression has achieved 58.82% accuracy, naïve Bayes has achieved 57.84% accuracy, the random forest has achieved 73.52% accuracy, CART has achieved 57.84% accuracy, AdaBoost has achieved 71.56% accuracy, MLP has achieved 58.82% accuracy and GBM has achieved 73.52% accuracy with the unprocessed data set.

Comparative analysis of result
To evaluate the goodness of our proposed addiction prediction system, we need to compare our work with some recent and relevant research works. We should take it into account that the presumption adopted by the researchers in collecting samples and reporting results of their research activities in processing those samples will have an intense indication of our endeavor for comparative performance evaluation. We have strived to compare our work with the other's based on some of the parameters like  Table 5 shows a comparative overview of other works and our work.
Zhang et al. [13] performed a prediction on daily smoking behavior with five features after collecting data from 15,095 people. Zhu et al. [15] worked on tobacco disease detection with 180 hyperspectral images with 32 features. In paper [16], prediction of HIV prognosis and mortality with smoking-associated DNA was done with roughly 0.78 AUC. Prediction on the smoking status by collecting patients' blood tests and health associated vital readings was done in [18]. Lee et al. [19] predicted alcohol use disorder by checking the treatment-seeking status of patients and they did not mention the accuracy of their work. In the paper [20] also, prediction on the risk of alcohol use disorder with different types of data were done yet they did not mention the classifier and accuracy. Prediction on alcohol abuse with ANN was seen in the work performed by Kumari et al. [21], and it showed an accuracy of 98.7%. Concerning the overall picture depicted in this section, our attained accuracy of more than 97.91%. has turned out to be good as well as promising enough. The reason behind our proposed solution to achieve a very high accuracy is that the features deployed are computationally simple to calculate and have very high discriminatory information to predict the risk of becoming addicted to drugs. As we have mentioned before, most of the other works are not very close to ours. So it would not be wise to explicitly compare the worthiness of our approach with other works.

CONCLUSION
In this paper, we have performed an in-depth exploratory work for predicting the risk of becoming addicted to drugs and alcohol using different machine learning techniques. First, we have formed the basis, i.e. feature set for this predictive work after talking to doctors and drugs-and-alcohol-addicted people and reading different articles and write-ups. Data have been collected and thoroughly preprocessed. The prediction of risk for addiction to drugs and alcohol has been accomplished with nine conspicuous classifiers. The merits of those classifiers have been measured in terms of six conspicuous performance metrics. The relative merits of the results achieved have been assessed by analyzing the results of similar works thereafter. We have achieved an accuracy of 97.91% with logistic regression classifier, which is good as well as promising. There remains a potential future work with a very large set of addicted and non-addicted people's data to cover an as much wider range of addicted and non-addicted people as required for Bangladesh.