For the introduction to background information on Reinforcement Learning, please check out the previous article here.

## Bellman Equation

MDP is a mathematical framework to model discrete-time stochastic systems. MDP consists of some States (S), Actions (A) to be taken while at each state, Transitions (T) from a state, s, to another state, s’, by taking an action with a probability T(s,a,s’). At each state, the agent who is interacting with the system, collects a reward R(s), or R(s,a), R(s,a,s’), depending on the stochasticity of the system.

A MDP is solved when the best policy, π(s) is identified for every state. Policy means the action the agent decides to take when at a state so the probability of the total reward in the first equation below is maximized as shown in the second equation below.

## Policy Iteration

Policy Iteration (PI) algorithm needs the T and R matrices. It starts with a policy to start with and updates the Value function as in Eq(1) and then finds the best policy from Eq(2) until the new and previous policies are the same. This is a straightforward iterative process.

## Value Iteration

Value Iteration algorithm

VI also needs T and R matrices. But it can bypass the need to find the policy in each iteration. It initializes a random V(s) matrix with random Bellman function value for each state. Then for each state it finds the best action and updates the V(s) until all the V(s) are stable. This way it can quickly find the best policies for each state that stabilizes the V-matrix, instead of having to explicitly find the policy and then compute V(s) as in PI. It might be faster than PI for same iteration but may often take more iterations.

## Model-free Q-Learning

Unlike PI and VI, QL is realistic that often the actual T and R matrices are not known in a real world problem. Instead an agent can only interact with the environment and try to update its impression about the system by updating its T and R matrix.

In QL, at each state, the agent takes a random action or the best action that maximizes the V(s). This is known as EXPLORATION vs EXPLOITATION controlled by a based on a parameter ε which determines whether the agent chooses to randomly explore a new action or take the best action from what is has seen**. **The agent takes random actions for probability ε and greedy action for probability (1-ε) If it does not explore in the beginning then it keeps taking the same decision based on its first action and gets stuck. So initially, more exploration is useful to make a rich experience of the environment. As the agent explores and takes random actions over and over, it can gradually become confident about its experience and may not need to explore as much. So, the parameter ε can be reduced gradually.

The Learning Rate parameter 0< helps update the previous Q(s,a) value with the new value. Lower value makes convergence slow and too high of a alpha value makes it fluctuate the V(s) a lot.

#### References:

**Bellman Equation: **https://en.m.wikipedia.org/wiki/Markov_decision_process

**Policy Iteration and Value Iteration: **Mihaela van der Schaar, Reinforcement Learning: A brief introduction

**Q-Learning**: Sargur N. Srihari, University of Buffalo