Predicting user behavior using data profiling and hidden Markov model

ABSTRACT


INTRODUCTION
Personal data is getting more difficult to obtain [1], and the most recent legislation allows users to connect with a service without providing personal information.Furthermore, a trustworthy personal profile must be constructed utilizing a variety of attributes that are frequently hidden and can only be deduced using predictive models.Implicit feedback, on the other hand, is easy to collect but highly rare.Although strategies based on factorization of the user-item matrix [2] are simple to implement (given the ability to use parallelized algorithms), the gap between the number of ratings and products is usually too large to allow reliable estimation.
Various probabilistic methods have been devised by researchers to understand user behavior patterns in social data [3].These models often incorporate a hidden parameter, representing a user's interest in a particular topic, estimated based on their interactions with messages, views, and other related content.The user data, such as their frequency of visits, number of messages inspected and received, play a crucial role in recommending services through web applications.Therefore, it is imperative to consider these factors while describing user behavior [4].Thoughts are sometimes associated with user behavior [5].We can use innate behavioral information to better comprehend a person's psychological and health behavior.It can also provide a method for predicting future states such as mood, which has a wide range of applications in psychology [6], medicine, and other economic transactions [7], where user mood is a major factor in decision-making.
Online user behavior is creating an unprecedented amount of data that is implicitly public.First, the amount of data is expanding as users spend more time on media applications and engage more through postings, comments, and other means [8].Second, users almost always contribute information.However, what is less clear is that the capacity to swiftly collect, combine, and analyze that information can provide extremely detailed information about the user.Individual pieces of information are harmless, but when joined with others, they might lead to unexpected revelations.
Psychological profiling is a technique used by psychologists and law enforcement agencies to understand the behavior and motivation of criminals or suspects [9].Behavioral profiling, on the other hand, involves the use of machine learning and sophisticated analytics to generate profiles of user behavior in various fields such as marketing, healthcare, and finance.The use of behavioral profiling can help organizations better understand and anticipate user needs.However, it is important to balance the benefits of behavioral profiling with privacy concerns and ethical considerations.
The term "prediction" is on everyone's lips.In the corporate sector, predictive analytics is exploding, and every company wants to know what their customers will do next.Predictive analysis can deliver great business value even when individual predictions are not always accurate [10].
The majority of prediction analyses are based on typical machine learning (ML) models such as linear regression, random forest (RF), and natural language processing (NLP).The difficulty with these models is that they use the entire dataset as input and provide a single column as a result [11].The goal is to figure out the data pattern.Regrettably, the outcomes of those models differ from one database to the next.This article aims to predict user behavior using data profiling and the hidden Markov model (HMM) technique [12].However, HMM allows profiling the most likely sequence of states.This paper will begin by profiling the dataset to better comprehend the data's worth and identify common user behavior states.Then, for easy application of the HMM model, we need to convert all textual data to a mathematical matrix.
The primary goal of our paper is to identify the users' behavioral directions, as well as the states of their behavior and the necessary actions that they must perform.The HMM model is one of the most applicable models for dealing with this problem.Indeed, after profiling the data, we can extract information from our database (min, max, repetition of an activity, ...).Using this metadata, we can then easily apply the HMM model.However, due to the Daylio application, we collected a user lifelog dataset that contains every action this user took in time as well as his mood at that time.After cleaning and analyzing the dataset, we apply the HMM model and predict the moods of the user based on his activities.
This paper outlines its structure, which aims to summarize previously discussed ideas.It starts by discussing pertinent attempts and significant challenges related to its approach in the first section.The second section covers data profiling and the HMM method.The paper's primary contribution is detailed in the third section, followed by a discussion of the implementation strategy.Lastly, the study finishes with a thorough discussion of the results and recommendations for potential future research.

BACKGROUND
Lifelogging is the practice of digitally recording a user's everyday activities for a variety of objectives and at various levels of detail [13].Such data can be saved to track daily activities and improve the human experience.Lifelogging is already being used in a variety of settings.It has, for example, been used to recall human memories [14], anticipate and diagnose physical health issues [15], detect chronic diseases [16], create themed and digitized diaries, record, identify, and review everyday activities, and gather and analyze data on aged health.Researchers in both the lifelogging and physical health fields have shown great interest in studying physical health due to its potential to enhance the quality of life for individuals.
Self-revelation behavior plays an essential role in various social media networks, affecting self-presentation and social desirability.This conduct caters to people's demands and influences their actions and attitudes [17].Recently, there has been an increasing interest in exploring the role of emotions in self-disclosure behavior [18].By evaluating online encounters that can be shared through an emotional perspective, a more comprehensive understanding of individuals' capacity to convey, comprehend, manage, and exchange their self-reports to gain social recognition is achievable [19].According to present studies, online communication facilitates the sharing of opinions, perceptions, emotions, and experiences more than in-person interactions [20].
Our preliminary research findings on comprehending consumers psychologically and tracking their mental health in four psychological categories using life logs were published in Dang-Nguyen et al. [21].These categories consist of measuring the big five personality "BIG5" traits [22], predicting mood and sleep  [23], and detecting music type and mood [24].We concluded that using lifelog data, we can psychologically analyze and model the person [25].
According to past psychological research, various elements influence a person's behavior.Environmental factors [26], exercise and physical activities [27], [28], weather and air pollution [29], [30], sleep duration and quality, working hours, heart rate, blood pressure, and an individual's personality, particularly in terms of extraversion and neuroticism [31], are some of these factors.Temperature, wind speed, sunshine, precipitation, air pressure, and photoperiod are all factors that affect mood, according to one study by Denissen et al. [29].Many studies on the relationship between body measurements and mood have shown that biometrics are mood markers and are influenced by mood [32], [33].
For researchers that are all heading in the same direction, some of them tried to anticipate and identify moods by gathering information from trial users' profiles, such as social participation, gender, linguistic style, and a variety of psychological data [34].According to this study, user behaviors and postings can be used as behavioral clues to characterize Twitter users' tweets as positive, negative, or neutral [30].Roshanaei et al. [35] the same group of researchers looked at predicting user emotions based on their mobile phone habits.To collect valuable and significant amounts of user data, they created an Android app that captures user emotions as well as certain physical data about their lives, such as activities they engage in, places they visit, and apps they are currently using.They offer correlations between user moods and these characteristics in their work, as well as create algorithms to predict user moods with promising accuracy.
ML involves several internal states that are challenging to identify or observe.An alternative approach is to infer these states from observable external factors [36].This is precisely what HMM does.For instance, in speech recognition, we listen to a spoken utterance (the observable) to deduce the underlying text (the internal state representing the speech).
Our method would be to apply a mathematical approach using historical lifelog data to forecast future user behavior.We can forecast the next and hidden states of user behavior using the HMM.This strategy will be realized when we can convert all of our textual data into mathematical form.

PROBLEMATIC ISSUE
Several researchers have focused on collecting metadata from web applications to gain a comprehensive understanding of user behavior, which is commonly referred to as data profiling [37].Studying user behavior can help to enhance the caliber of different services and goods [38] and offer them in a way that satisfies consumers' needs.To effectively market their products and services, large corporations are constantly trying to predict and comprehend their clients behavior.However, determining users' behavioral patterns, their states, and the crucial actions they must take [39] can be challenging.According to our research, the HMM model is the most suitable method for addressing this issue.We can extract information from our database after profiling the data (such as minimum, maximum, and activity frequency), and then use this metadata to apply the HMM model easily.

PRELIMINARIES
This section will focus on two methods that will be used for the project: data profiling application and HMM approach.The data profiling application will be utilized to extract meaningful insights from the given dataset.On the other hand, the HMM approach will be used to build a model to predict the next possible event based on the previous sequence of events.

Data profiling
Profiling is a procedure that involves gathering data from various data sources (databases and files) and compiling statistics and information on such data [40].As a result, the data analysis is quite close [37].The process of data profiling is useful in various scenarios where maintaining data quality is crucial.It can help with data warehousing, business intelligence, and linked data by assisting in the identification of potential problems and the implementation of necessary fixes.Moreover, data profiling is crucial for data migration and conversion since it reveals data quality problems that could be overlooked during translation or adaption.
It may come as a surprise that we engage in more data profiling while cleaning and preparing the data than during the preview stage.During this stage, we often handle, clean, remove, or repair various data elements such as NULL values, errors, missing values, noise, or unexpected data artifacts.Some activities necessitate expertise in a particular field or domain.For example, converting an attribute to a specific physical unit, generating a new explanatory variable by calculating the ratio of two specific attributes, or converting an IP address to a geographic location [41].
Predicting user behavior using data profiling and hidden Markov model (Bahaa Eddine Elbaghazaoui)

5447
Accurate and effective data science models rely on a comprehensive understanding of the data being analyzed [42].Therefore, data profiling is considered the most reliable and efficient way to "get to know the data" for analytics and machine learning initiatives [43].Data profiling involves examining and evaluating data from multiple sources to identify patterns, quality, completeness, and consistency of the data.This process enables data scientists to gain insights into the data and determine the necessary steps required to clean, transform, and prepare the data for modeling.

HMM
The HMM is a modified version of the Markov chain, which is a mathematical model that provides information about the likelihood of sets of random variables or states, such as words, tags, or symbols representing various things such as the weather.The Markov chain assumes that the current state is the only one that matters when predicting future events in a sequence, and that previous states have no influence on the future state.This is like to forecasting the weather for tomorrow without considering the weather from yesterday.

𝑃(𝑞
HMM provides a more probabilistic approach to modeling time-series data.Unlike traditional methods, which are based around mathematical equations, HMM models a set of discrete states and transitions between them to capture the underlying patterns in the data.This approach allows for more accurate and robust predictions, as it can capture complex relationships and patterns between the data points.In addition, the use of stochastic models provides a means to model uncertainty and variability in the data, which can be beneficial in many contexts.Finally, HMM models are typically more computationally efficient than traditional methods, as they require fewer parameters and can be scaled up or down to adjust the complexity of the model.
Consider a list of state variables, denoted as qi, can be modeled using a Markov model.This model is based on the assumption that the future probability of the sequence depends only on the present state, and not on the past.Thus, the Markov model simplifies the prediction of future states by disregarding the influence of the past states.
The HMM is a probabilistic model that uses observable data to infer unobserved information [44].It is a statistical modeling technique that's utilized in speech recognition, handwriting recognition, as well as other applications [45].In many circumstances, what is observed is not reality, and the HMM is one method for recovering the hidden truth.However, it is not magic for everything, and to utilize it, one must meet certain assumptions, which are the foundations of the HMM.
The focus is on the observations sequence rather than the sequence of states in HMM, a variation of Markov models where the states generating observable data are not explicitly observable [46].Given the value of the hidden variable at moment  − 1, the hidden variable's value at time  dictates both the value of the observable variable at time t and the conditional probability distribution of the hidden variable at instant .As a result, HMMs are particularly useful for imitating situations when a system can only be partially seen.
The HMM assumes a discrete state space for the hidden variables and a continuous Gaussian distribution for the observations.The HMM has two parameters: the probabilities of state transitions and the probabilities of output.Given a hidden state at time  − 1, the state transition probabilities determine the selection of the hidden state at time .This yields one of N possible values for a set of hidden states modeled by a categorical distribution.Figure 1 depicts the HMM process, where the current state and transition probability matrix A determine the Markov process, which is hidden beneath the dashed line.
The higher-level Markov model can be transformed into a hidden Markov model, as depicted in Figure 2, for a better understanding.In this model, the state variables are hidden, and only observable outputs, or emissions, are visible.The model assumes that the probability distribution of each output depends only on the current hidden state, which can be inferred using the observable emissions.
Markov chain model is clearly visible: the "start" state, the two "state 1" and "state 2" states, as well as the transitions and the corresponding probability.Two new observations have been added: "A" and "B".Each state has a given probability of emitting each observation, which we call "probability of emission.This probability could be null, indicating that the state is unable to issue the requested observation.In our example, state 1 has a 50% probability (0.5) of producing a "A" and a 50% chance of producing a "B" whereas state 2 has a 90% chance (0.9) of producing a "A" and a 10% chance (0.1) of producing a "B".The sum of the state's emissions probabilities must always be equal to one, just as it must be for transitions leaving that state.The basic components of the HMM model is as: i) hidden states, ii) observation symbols the return to the initial hidden state when the initial state has changed, iii) transition to the terminal state probability distribution, iv) distribution of state transition probabilities, and v) distribution of state emission probabilities.
HMM is divided into two sections: hidden and observable.The hidden component is made up of hidden states that cannot be seen directly but are detected by observation symbols that hidden states emit.For example, we do not know what mood our user is in (mood is a hidden state), but we may see their actions (observable symbols) and infer what hidden state he is in from those actions.
There are three common problems that an HMM can be used to solve: i) calculate the probability of a specific sequence using an automated system (using the forward algorithm); ii) finding the most likely state (hidden) that led to the generation of a certain sortie sequence (using the Viterbi algorithm); and iii) given a flight sequence, determine the most likely set of states as well as the likelihood of each state's sorties.The Baum-Welch algorithm, often known as the forward-backward algorithm, is used.
Let's take the on-screen keyboard on a mobile phone as an example [47].We may occasionally mistype the character next to what we meant to type.The observed data is the character you mistyped, while the unobserved data is the one you wanted to enter in your head.Another example is that owing to random noise, your global positioning system (GPS) measurements (observed) may jump into your actual location (unobserved).
A hidden Markov model can be evaluated using one of two approaches HMM.a) Likelihood of test data: In this strategy, some test data should be kept and the likelihood of the test sequences computed using the forward algorithm.b) Predicting data parts based on other data parts: The application determines whether or not a prediction task is significant.You might be interested in foreseeing the future, for example.You can utilize the forward method to follow the state of the hmm in this scenario.

METHOD AND CONTRIBUTION
This study presents our workflow approach, which is illustrated in Figure 3.The initial stage is to gather information about user behavior and activities.In the second step, we use data cleaning to identify relevant and powerful data, followed by data profiling to gather metadata for our HMM model.The final step will be to apply HMM to our metadata.
To make things clearer.In the first phase, we must choose a dataset or an application programming interface (API) that contains information about user behavior as well as its daily activities.These details will make it easier for us to put our strategy into action.In the second section, we must clean up our database of incorrect, incomplete, and null information [48].As a result, for each state, calculate the average, minimum, ISSN: 2088-8708  Predicting user behavior using data profiling and hidden Markov model (Bahaa Eddine Elbaghazaoui)

5449
and maximum values to get a broad picture of the data.In the final section, you must calculate the matrix of state transitions and the emission matrix between each state and activity.The initial value of each state can be taken as 1 divided by the number of states.Finally, we'll apply our HMM model.

Figure 3. Approach workflow
In this case, our issue is based on the second problem, as mentioned in the preliminary portion of the HMM section.The Viterbi algorithm's goal is to draw an inference based on some observed data and a trained model [49], [50].It works by posing the following question: given the trained parameter matrices and data, what are the states that maximize the joint probability?In other words, given the data and the trained model, what is the most likely option?The following algorithm 1 can be used to represent this statement, and the answer is dependent on the facts.Let Q[1..T] be an array to store the most likely sequence of states 16 The algorithm 1 is a dynamic programming method that is used to identify the sequence of states with the highest probability in an HMM.The purpose of this algorithm is to maximize the probability of the most likely sequence of states given a sequence of observations.To achieve this, the algorithm initializes a two-dimensional array called V, which is used to store the probabilities of the most likely path.Then, for each state, the algorithm assigns probabilities of the observed sequence.Next, the algorithm iterates over each time step and, for each state, finds the maximum probability of the most likely sequence of states given the previous probabilities.The algorithm then utilizes the array with the maximum probability to identify the most likely set of states.The program then produces the most probable series of states.
The accuracy of an HMM can be determined using Python by calculating the model log-likelihood.The log-likelihood is a measure of how well the model fits the data.To calculate the log-likelihood of an HMM, you need to first fit the model to the data using the fit() method.This will create an instance of the HMM with its estimated parameters.Then, you can use the score() method to compute the log-likelihood.The better the model fits the data, the higher the log-likelihood.

IMPLEMENTATION AND RESULTS
Based on the workflow that we provided in our contribution.The steps for our implementation will be as follows.In the first instance, we need to gather a set of user-relevant data from a donation database or an API.Following whatever search one conducts, one discovers a database collected by the application Daylio that is tailored to our needs.The dataset may be found on the Kaggle website [51].The dataset is Abid Ali Awan's lifelogs with goals, and it comprises full data about time, emotions, and activities that affect mood from 03/02/2018 to 16/04/2021 as represented in Table 1.
The dataset has 940 data rows.It has many columns as shown in the previous Table 1, but the last two columns (activities and mood) are the ones that interest us in our approach.The user "Abid Ali Awan" frequently alternates between five moods (good, normal, amazing, awful, and bad), and there are numerous activities to choose from, such as reading, learning, and praying...).
Following the data quality step, we applied a Python script to clean the dataset.However, we take the columns that we are interested in and also delete the rows that have null or empty values.After that, we started using data profiling.We used the pandas-profiling library to get a description of metadata such as min, max, and the most frequent values, as seen in Table 2. Data profiling in our work is based on extracting metadata that will help us predict emotional states based on activities.To accomplish so, we must define the transition matrix between states (between moods) as well as the transmission matrix between states and symbols (moods and activities).
It is time to talk about transition matrixes.We use the term "transition" from one specific state to another.In fact, a transition between "Good" and "Bad" is a day when the user is "Good" and the next day is "Bad".The transition value between these two states is the value of transition between these two states divided by the total number of transitions.Table 3 shows the values of the transition matrix in our dataset.Now, we must generate the emission matrix, which contains the probabilities for each state of emitting each of the potential observations.In fact, if we choose the state "Good" and the symbol "designing", the emission value is the number of days that the user is "Good" as well as performing the activity "designing" divided by the number of days that the user is "Good".Following the extraction of all the necessary data.For our preferred language "Python", we used the hmmlearn library with Gaussian mixture emissions.The hmmlearn is a collection of techniques for unsupervised learning and inferring HMM.

DISCUSSION
The following experience is used as an application.We give our program the following lists of activities.This set of activities represents an example, if a user does these activities, what states can be obtained according to our program?Indeed, we tried in the first tests, random activities that were different and not repeated, for example: According to our tests, if we repeat some activities that the user must do, we predict that the mood states will repeat too (i.e., if the user watches YouTube several times, the mood states will be "Awful").For this, we find that user activities influence the future state of mood.This validates the HMM hypothesis and enhances the results of our program.
The evaluation of our system can also be based on the Likelihood results.The log-likelihood value of a regression model is a measure of how well the model fits the data.A higher log-likelihood value indicates a better fit to the dataset.The log-likelihood method is useful when comparing multiple models.In practice, multiple regression models are often fitted to a dataset, and the model with the highest loglikelihood value is chosen as the best fit.
A degenerate solution will result from fitting a model with 34 free scalar parameters with just 5 data points and a (ℎ) =165.2432.If we take 9 activities from our dataset, we get (ℎ) =14.3623.If we take 100 activities from our dataset, we get (ℎ) =-238.5843.
The results indicate that the model is overfitted when using a small data set of 5 data points with 34 free scalar parameters.The log-likelihood score is much higher than when using larger datasets of 9 and 100 activities.This suggests that the model is better at predicting outcomes when more data points are available.Additionally, the log-likelihood score is much lower when using a dataset of 100 activities as opposed to 9, indicating that the model is more accurate when more data points are used.

CONCLUSION AND FUTURE DIRECTION
HMM has the ability to capture the underlying structure of user behavior, allowing for more accurate predictions.This has been used to great success in a variety of different applications, from predicting user behavior on websites to predicting stock market movements.By leveraging the power of HMM, we can better understand user behavior, predict their next state, and ultimately improve user experience.The objective of this project is to combine basic duties.However, storage and data collection, state and symbol profiling from our data, and the extraction of the transition matrix and emission matrix are all steps in the process.Then, using the HMM model, we forecast hidden states based on observable symbols.This method will aid us in comprehending and anticipating customer behavior, and the current product will be tailored to meet their needs.
In our future directions, we will explore the potential of combining HMMs with machine learning techniques to create more efficient and powerful solutions.We will investigate the use of HMMs to improve the accuracy of machine learning algorithms, as well as explore their potential to enable more intelligent and dynamic machine learning models.We will also investigate the potential of HMMs to act as an intermediary between machine learning algorithms, potentially allowing for more efficient and interactive learning systems.

Table 3 .
Transition matrix result