Reinforcement Learning: Introduction III

Introduction

This is the third part of a four parts long series about reinforcement learning. In this part the main focus is on solving the Bellman-equation.

Policy evaluation, value iteration

As a recap the Bellman-equation:

\begin{equation} v_\pi(s) = R(a,s) + \gamma \sum_{a, s’}{\pi(a|s) T(s’,a,s) v_\pi(s’)}. \end{equation}

If the solution ($v_\pi(s)$) was known then the agent whould know how worthy is the state it is in and where it should go next. But the state-values depend on the policy ($\pi$) therefore the values are true if the agent follows that policy. With following different policy the values can be different. Fortunately, if the corrected policy, $\pi’$, (a bit different from $\pi$) chooses each time the step with better resulting state value, then this new policy results better (more optimal) state-values, meaning:

\begin{equation} (\forall s \in S): v_{\pi’}(s) \geq v_\pi(s). \end{equation}

This is the so called policy improvement theorem. The final goal is to find the optimal policy which gives the highest possible expected utility for each state. The Bellman-equation is a simple sytem of linear eqautions which can be solved by methods from linear algebra. This is the policy evaluation and it gives the state-values for a fixed policy. Due to the policy improvement theorem in each state the best equation should be chosen in order to enhance the policy. Then a new policy evaluation gives the new state-values. This iterative process is referred as policy iteration. The so called value iteration fuses these two steps into one by modifying the Bellman-equation (for deterministic policies):

\begin{equation} v(s) = R(s) + \gamma \max_a \left(\sum_{s’}{T(s’,a,s) v(s’)}\right). \end{equation}

After that, this is no more a linear equation. It can be proved that the solution can be approximated iteratively by successive approximation. This means that the following update rule is applied consequtively until sufficiently accurate results are achieved:

\begin{equation} v_{t+1}(s) = R(s) + \gamma \max_a \left(\sum_{s’}{T(s’,a,s) v_t(s’)}\right). \end{equation}

The state-values ($v_t(s)$) are getting closer and closer to the optimal solution step-by-step. Then the corresponding (closely optimal) policy is given as:

\begin{equation} \pi(s) = argmax_a \left( \sum_{s’}{T(s’,a,s)v(s’)} \right). \end{equation}

Similar equations are true for action-values:

\begin{equation} Q_{t+1}(s,a) = R(s) + \gamma \sum_{s’}{T(s’,a,s) \max_{a’} Q_t(s’,a’)} \end{equation}

and

\begin{equation} \pi(s) = argmax_a Q(s,a). \end{equation}

Monte Carlo method

The previous section showed methods to solve the Bellman-equation. Unfortunately, the transition matrix ($T(s’,a,s)$) is unknown and its very difficult to determine. This is known as the curse of modeling. To tackle this problem it is tempting to get rid of the model somehow. It is possible to do that by Monte Carlo sampling. The samples are decision sequences (action, state, reward, new state etc.) produced by the agent which follows its policy and starting from a state (each time the starting state is randomly initialized). Then the return (utility) will be exactly given which is an approximation of the state-value in the starting state. It can be proved that after sampling a lot of return ($G(s)$) in a state (known due to the decision sequence), their average will be close to the expected utility (state-value) in that state. Furthermore it is true that:

\begin{equation} v^{\pi}(s) \rightarrow v^{\pi}(s) + \alpha(t) \left( G(s) - v^{\pi}(s) \right). \end{equation}

Where $\alpha(t)$ has to satisfy three conditions:

\begin{equation} \lim_{t \rightarrow \infty} \alpha(t) = 0;\ \sum_{t=1}^\infty{\alpha(t)} = +\infty;\ \sum_{t=1}^\infty{\alpha^2(t)} < +\infty. \end{equation}

Monte Carlo method keeps drawing samples by running newer and newer rounds starting from a state and waiting for the end. Then it calculates the return for each state which was visited during the way. After that by using the previous formula it updates the estimations of the state-values. Sufficient number of samplings provides arbitrarily accurate result. The problem is that this technique tends to be very slow.

Temporal difference learning (TD-learning)

In order to speed up the Monte Carlo-based policy evaluation the return should be approximated by using the currently known values. The following update rule is the key of TD-learning:

\begin{equation} v^{\pi}(s) \rightarrow v^{\pi}(s) + \alpha(t) \left( R(s) + \gamma v^{\pi}(s’) - v^{\pi}(s) \right). \end{equation}

Note here, the update is done for only one state at a time. After that the policy can be improved by the following formula:

\begin{equation} \pi(s) = argmax_a \left( \sum_{s’}{T(s’,a,s)v(s’)} \right). \end{equation}

But in case of the action-value:

\begin{equation} \pi(s) = argmax_a Q(s,a). \end{equation}

This is not depends on the transition probabilities and provides the basis for the so called model-free learning.

Q-learning, Sarsa-learning

The Q-learning uses the action-value function and the following update rule:

\begin{equation} Q^{t+1}(s,a) = (1-\alpha)Q^t(s,a) + \alpha \left( R(s) + \gamma \max_{a’} Q^t(s’,a’) \right). \end{equation}

The $s’$ is the next state after $s$. It is a very interesting thing that in this case the next action can be choosen in anyway and the convergence to the optimal values will occure. That is why Q-learning was a break through and has become one of the most popular RL algorithms. The pseudo code of the Q-learning:

INIT:
initialize Q(s,a) for all s, a
initialize alpha, gamma
choose starting state s <- s0
choose starting action a <- a0
set I to 1

REPEAT: repeat until convergence OR I > maxiter
choose action a in s from Q(s,a)
  (E.G.: epsilon-greedy)
take action a
observe reward r and new state s'
UPDATE
s <- s' and I <- I + 1
check convergence

Epsilon-greedy means that with a small probability the action is choosen randomly instead of $argmax_a Q(s,a)$. Sarsa-learning got its name after (s, a, r, s, a), the decision sequence. The update rule and the pseudo code can be seen below.

\begin{equation} Q^{t+1}(s, a) = (1-\alpha)Q^t(s, a) + \alpha (r + \gamma \cdot Q^t(s’, a’)) \end{equation}

\noindent The pseudo code of the algorithm:

INIT:
initialize Q(s,a) for all s, a
initialize alpha, gamma
choose starting state s <- s0
choose starting action a <- a0
set I to 1

REPEAT: repeat until convergence OR I > maxiter
take action a
observe reward r and new state s'
choose action a' according to the policy
UPDATE
s <- s' and I <- I + 1 and a <- a'
check convergence

The main difference between the two is that Sarsa is a so called on-policy learning while Q-learning is an off-policy learning. To explain this, first it is important to understand that after the problem had become model-free the action had two roles: (1) exploring the environment and (2) gathering rewards. These are against each other and this is a trade-off to which one put more emphasize. This is the so called exploration-exploitation dilemma. The optimal policy which is the goal to determine is also referred as target policy. The policy during learning which chooses the $a’$ in the previous update equations are also target policies because at the end it will be the optimal one. During learning the environment is sampled and it can be done with the same policy as the current target policy or a different one, the so called behaviour policy. This is the policy which chooses the next action which will be executed in the environment. In case of Q-learning the target and the behaviour policy is different. In case of Sarsa they are the same. Therefore Q-learning is able to disjoin the exploration and exploitation.

Approximations (DNNs)

The previous sections introduced methods which are based on a critical assumption. Each value for each state can be stored in the memory like a table. Unfortuantely, in real life applications the size of the state space (or even the action space) is so huge that it is impossible to handle it as a table. Therefore generalization (approximation) is ncessary over the state space. This can be done with linear-approximators or non-linear approaches like neural networks.

In the next part a case study will be introduced and it will show how non-linear approximators work in a concrete example.

Examplary RL algorithms can be found in my github repository.

View on GitHub