Reinforcement learning strategies using Monte-Carlo to solve the blackjack problem

ABSTRACT


INTRODUCTION
The experimental version of blackjack is played as follows: at the beginning of each round, the agent and dealer will each get two cards from a deck of size, where size equals 52.The first dealer's card value and the total worth of the dealer's hand both form their features in the construction of state spaces.The agent can then select from the action space, action (state)=hit or stick, with the resulting consequences of drawing a new card for the agent's hand or submitting the existing hand for scoring, respectively.Given the limitations of the game's play, the agent is urged to obtain the maximum squared score possible.
Techniques for reinforcement learning (RL) rely on input from the outside world to help learners progress.The agent is guided in formulating its policy by feedback, which comes as a monetary reward signal.A Markov decision process (MDP) is typically used to represent the environment.An MDP comprises several phases, activities, simulated results, and predicted rewards.Each action has a chance of being chosen Int J Elec & Comp Eng ISSN: 2088-8708  Reinforcement learning strategies using monte-carlo to solve the blackjack … (Raghavendra Srinivasaiah) 905 and a value linked to it that reflects the expected benefit of doing the action.The most beneficial activity is one that is motivated by greed.The agent must strike a balance between exploring and making use of the surroundings to learn.The agent tries a greedy approach throughout the exploration to enhance its assessments of their values.The agent will update the state values and, in addition to that, state-action values on these methods independently.We then present the results of applying each technique in terms of the win, draw, and loss percentage per game regime and each process for each game regime, respectively.The framework to mathematically formalize this optimization problem is modelled as RL.It was proposed that, given a discrete step, t, the values of states are linked recursively by the Bellman equation.Value functions allow the agent to determine the immediate value of being a resident of a state s or acting in a way that starts with s while adhering to specified norms.These value functions might technically be defined as in ( 1) and ( 2): The agent can calculate the predicted benefit using V  (S), the state-value function, of being in state s and following policy.The agent can determine the immediate value of being in state S, acting in method a, and subsequently carrying out policy based on the action-value function Q*(S, a).The activities that produce the highest long-term reward make up an optimum policy.The ideal state-value function is denoted by V*(s), while the perfect action-value part is characterized by Q*(s, a).The properties of different learning algorithms and their behavior are listed in the following section, namely Monte Carlo (MC) algorithm, q-learning (QL) algorithm, dynamic programming (DP) algorithm, and temporal difference (TD) learning algorithm.
The MC technique is used to solve the blackjack problem.It is simple to set up exploration states that consider all potential outcomes because the episodes are based on simulations of games.In this instance, the player's total, the dealer's cards, and the player possessing a usable ace are all chosen randomly with the same chance.The estimated policy that stays on only 20 or 21 from the last blackjack instance is used as the initial strategy.For all state-action couples, the initial action-value function can be zero.
Consider the situation where we need to approximate v(s), the assessment of state' s' using policy, having a collection of episodes we acquired by subsequent and transitory through s.A visit to s refers to each instance of state' s' in an episode.Although s could appear more than once in a single episode, let's guide to the initial appearance as the first visit to s.Each visit using the MC technique is considered as the average of the yields after all trips to s, whereas the initial visit MC technique approximates v(s) as the mean of the returns after initial visits to s.Although the theoretical characteristics of these two MC approaches are similar, they differ somewhat.The independent estimations for each state are a crucial aspect of MC techniques.Unlike DP, the estimate for the state does not add to the approximation of other states [1]- [7].
The heuristic moves update the means of all actions performed first on their intersections with the same color as the first move with the score after a random game with a score rather than updating the mean of the actions of the random game.Symmetrically, the steps as-first heuristic modifies with the opposite score the means of all movements performed first on their intersections with various colors from the first move.Overall, this heuristic updates nearly every move's mean as though it were the initial move in the random game.This heuristic could be more accurate since it may update two movements with the same score even if they have distinct impacts based on the timing of their play.Since TD errors in RL are dependent on estimations of the value function, which are dynamically changing, it is evident that they are pretty noisy.
Furthermore, in this issue, the policy is also altering.Hence one would anticipate that the TD errors would not be stable.Estimating q(s, a), the predictable return while being in initial state s, using action a, and after policy, is the goal of the policy analysis problem for action values.The MC techniques are substantially similar to those previously discussed for matters for states, except we now speak about trips to state-action duo rather than a state.When a state is reached, and action is reserved in a given episode, the pair of actionstate s is said to have been visited.The value of the action-state duo is estimated by each visit using the MC technique as the returns average that trailed all the visits to it.The returns after the state were visited and the action was chosen for the first time in each episode are averaged using the first-visit MC approach.As more visits are made to each state-action pair, the techniques, as previously, converge quadratically to the valid anticipated values.The agent is restricted to searching for optimal actions via updates.In other words, the agent may find an optimal action from trajectories identical to the current state, and this is contrary to QL, where the agent may search for any state that matches the current rather than being restricted to those states that arose from identical trajectories.The QL algorithm is the TD method that gets the closest to Q*(s, a).Q*(s, a) estimations are modified at every step using incremental algorithms in TD approaches like QL. Below is a description of the QL in (3).Given that learning may occur while playing, the QL algorithm is a great way to approximate the best blackjack strategy.As a result, it is an excellent option for the blackjack problem domain.Blackjack is phrased as a serialized activity, with each hand's finish signifying the end of a single episode.The agent's current point total, the dealer's face-up card value, the hand's softness, and the possibility of splitting are all included in the state representation [8]- [16].
A class of techniques known as DP combines solutions to sub-problems to solve more significant, complicated issues.A known MDP may be solved using DP approaches in planning by identifying the optimal value function and its accompanying optimal policy.A fundamental tenet of optimal control is Bellman's principle, which argues that if an optimum policy has already been chosen, the subsequent choices must also optimal in light of the state created by the prior decisions, and this is often referred to as the discrete-time Hamilton Jacobi Bellman equation or the Bellman optimality equation.The best course of action would then be as in (4).
Bellman's principle in ( 5) results in a time approach calculated backward for resolving the optimum control issue since one has to know the ideal policy at time usage to derive the optimal policy at time k.It is the foundation for the DP techniques widely used in operations research, control system theory, and other fields.These are offline planning techniques.
Action-value functions are used to store the policy implicitly.The policy is greedy concerning the value-action-function Q is known if, for any state' s', a value-action Q (s, a) is accessible for all actions.With this modification, policies won't need to be stored explicitly.To keep the value function, it should be noted that moving from V to Q increases the memory need by a factor of |A| [17]- [26].
One approach to solving RL issues is using TD.The predicted long-term payoff for performing a particular action in a state is estimated by TD techniques using a value function.The TD method is an online learning approach.The agent updates the state value during a trajectory or episode rather than waiting until the game is finished and then updating over the explored trajectory sequentially using (6).
The value function is trained in TD models by calculating a value-prediction error signal every instant the agent switches the states.The transformation between the estimated value and the actual value, which includes the instant reward, as seen during the switchover, is represented by the letter "d".The previous state's value estimate can be reorganized based on d to be closer to the observed value.This "d" signal manifests in response to unexpected rewards, propagates with learning from rewards to pre-emptive incentives, and modifies in response to variations in predicted reward.The interest of RL is to create the best possible policy that maximizes value across all possible states.The methods employed could iteratively build such a policy from data.For the policy to be evaluated by on-policy algorithms, it must primarily be greedy regarding value function estimations [27]- [35].

EXPERIMENTS AND EVALUATION
Hit-and-stick actions are included, and an effort is made to maintain the integrity of the game using all feasible measures.The following summarizes the reward system: For every action that does not change to a terminal state, a zero reward is awarded.The agent is rewarded according to the game's outcome when a terminal condition is attained.For instance, if the agent gains $1 and wins the hand, they are rewarded with a +1.If the hand is lost, the agent is rewarded -1.Due to the state representation's inability to foresee which cards may come into subsequent hands, a static betting approach is employed.
The performance of the learning agent was evaluated using two hardcoded players (dealer and player).The acts of the first player are wholly arbitrary, whereas the second player employs a basic plan.Using a basic strategy, the best course of action for the state representation selected reduces the casino edge to less than one percent.We utilize several runs in the experiment, each consisting of a loop of test and training hands.During training, the agent interacts with the environment states and uses the same lookup tables.The agent competes against the player, and the player uses basic strategy during the test hands.After performing a series of first random trials, every trial places the opening card.While the number of test hands stays fixed, numerous bets are eventually reached by gradually increasing the amount of training hands towards each run.

RESULTS AND DISCUSSION
Mean-variance optimization in MDPs and work on resilient MDPs both focus on increasing the safety of policies.However, it should be noted that the method suggests a combined criterion that considers both the mean and variance of the value function, which is much more expensive to optimize.Furthermore, even if needed later on, optimal values and policies cannot be retrieved in that work since the value function is not learned individually.The controllability only influences the action options; the optimal values are unaffected by controllability values and are done to retain the value function's correctness.
The initial policy of the RL agent is entirely arbitrary.Therefore, it will likely produce similar results to the random player.Because the QL method closely approximates Q*, the learning agent's strategy should reach a fundamental approach as the quantity of training hands rises (s, a) and is accurate even for the best practice, highlighting the difficulty of gaining money when playing blackjack.Based on a comparison between the two, the learning agent outperforms the random player significantly.It is clear that perhaps the agent is picking up relevant knowledge during the iteration.The efficiency of the learning agent improves as it asymptotically approaches that of a player adopting a primary strategy and is the predicted Q* since the QL process directly approximates the ideal action-value function Q* (s, a).
Figure 1 illustrates the strategy to hit to stick for each card the learner adopts during the game.The learning is improved over a higher number of iterations.The value function differs when using the ace card, as shown in Figure 2. The dealer and player values are plotted against the state value.This investigation involved an agent learning to model a finite MC decision chain.A natural extension is to consider an infinitesized deck such that now the agent would attempt to model a Markov chain-MC-based RL; For this; we expect the agent to find the stationary per-game score value, which appears to be converging.Including additional blackjack game actions like split and double would make the application more attractive and, in turn, cause the action space to become A = hit; stick and edge the agent's policy closer to real-life application, but in addition to that, exponentially increasing the size of the state space.The snapshot of the state-action space of the implementation is indicated in Figure 1.carried out in the work.This investigation found that training an agent to play a simplified version of blackjack produces desirable results; the agent can see a policy p that makes an average win, draw and loss rate of 38:26 across different policy iteration methods.The best blackjack strategy employing the MC method was investigated in this study using RL.It has been shown that the learning agent outperformed random and converged to a nearly ideal policy.Although the results are positive, there is still potential for development.The outcomes could be improved with a better exploration technique like the Bayesian MC Learning Algorithm.Additionally, a policy outperforming basic strategy may result from a more robust state representation combined with a dynamic betting approach.The effect on value function based on the rule change related to including or not including the ace is also studied.

Figure 1 .
Figure 1.Strategy outcome and state space snapshot Figure 2. Value function with/without ace