Reinforcement Learning: Introduction II

Introduction

This is the second part of a four parts long series about reinforcement learning. In this part the core mathematical concepts are covered: MDP, value-function and Bellmann-equation.

MDP

MDP stands for Markov Decision Process. In this section a short overview is provided. Reinforcement learning is one of the most well-grounded branch of ML due to MDPs. Most of the reinforcement learning tasks can be described as an MDP at least approximately.

Let’s consider a sequence of consequtive states generated by an agent. In order to describe this sequence from the mathematical point of view, the states are handled as probability variables. Then it is a correct question that what is the probability of being in a state if all the previous states are known. This can be formalized as follows:

\begin{equation} P_{\pi}(s_{t+1}) = P(X_{t+1} = s_{t+1}| X_{t} = s_{t}, …, X_0 = s_0) \end{equation}

$\pi$ refers to the current on-going process, $X_{t}$ is the probability variable of a state at time point $t$, $s_t$ is the concrete state and $s_t \in S$, where $S$ is the set of the possible states. In a general case it is difficult to mathematically derive further information about a sequence like above. However, a special case is much simpler. Suppose that the next state is only depends on the current state and not on the previous states:

\begin{equation} P(X_{t+1} = s_{t+1}| X_{t} = s_{t}, …, X_0 = s_0) = P(X_{t+1} = s_{t+1}| X_{t} = s_{t}) \end{equation}

This called as Markov property. The sequence ($X:={X_0, X_1, … X_t}$) which holds the Markov property is a so called Markov chain. A classical example is the random walk in one dimension. An object can move up and down starting from a fixed origin. Then it can move one step up or one step down. The object chooses its action stochastically with equal probability for both directions. The probability variable is the height of the object. The sequence of the heights forms a Markov chain.

The connection to RL is the following. In RL everything revolves around one or more intelligent agent which make effort to gather the most rewards. To achieve this the agent should make decisions in each state then take actions. If the sequence of these states is a Markov chain then this process is a Markov Decision Process, shortly MDP.

Value-function

In order to make right decisions, the agent should know somehow what action is the most worthy to take. In reinforcement learning there are two approaches which address this claim:

state-value functions
action-value functions.

Before the definitions, first let’s some words about the utility of state sequences. If the agent knew the utilities for each sequences starting from its current state, then the next action would be chosen by following the Maximum Expected Utility principle (MEU). In case of sequential decisions the utility function is required to be separable. This means that for a function $f$ the folowing holds:

\begin{equation} U([s_i … s_T]) = f(s_i,U([s_{i+1} … s_n])). \end{equation}

Basically, there are two possible types of $f$. The first is an additive function, the second is a discounted based approach.

\begin{equation} U([s_i … s_T]) = \sum_{j=i}^T{R_j} \Rightarrow U([s_i … s_T]) = R_i + U([s_{i+1} … s_n]) \end{equation}

\begin{equation} U([s_i … s_T]) = \sum_{j=i}^T{\gamma^{j-i} \cdot R_j} \Rightarrow U([s_i … s_T]) = R_i + \gamma \cdot U([s_{i+1} … s_n]) \end{equation}

The last two expressions for the utilities of state sequences are the bases for defining the value-functions. The $R_i$ is the reward function and $\gamma$ is the discounting factor. The main point of the defintions is the MEU principle.

Definition 1: (state-value)

Additive:

\begin{equation} v_\pi(s) = E_\pi \left[ \sum_{j=i}^T{R_j} | s_0 = s \right] \end{equation}

or the discounted version:

\begin{equation} v_\pi(s) = E_\pi \left[ \sum_{j=i}^T{\gamma^{j-i} \cdot R_j} | s_0 = s \right]. \end{equation}

The subscription $\pi$, refers to the policy of the agent. Remember, the policy is a mapping from states to actions: $\pi = p(a|s)$. The value-function shows the expected utility (sum of rewards in case of the additive version) of the sequences generated by the policy when started from state $s$. When the policy is stochastic then different sequences are possible.

The action-value is analogous but it gives the expected utility of the sequences generated by the policy when the agent started from $s$ and its first action was $a$.

Definition 2: (action-value)

Additive:

\begin{equation} q_\pi(s,a) = E_\pi \left[ \sum_{j=i}^T{R_j} | s_0 = s, a_0 = a \right] \end{equation}

or the discounted version:

\begin{equation} q_\pi(s,a) = E_\pi \left[ \sum_{j=i}^T{\gamma^{j-i} \cdot R_j} | s_0 = s, a_0 = a \right]. \end{equation}

Interesting and important connections can be found when the process is an MDP.

Bellman-equation

Let’s suppose that the sequence of states is a Markov chain for any possible policy. Then the transition between two consequtive states can be expressed by using a transition probability function (T) due to the Markov property:

\begin{equation} T(s_{i+1},a_i,s_i) = p(s_{i+1}|s_i,a_i) = p(s_{i+1}|s_i,a_i, … s_0, a_0). \end{equation}

After that, a dynamic programming formula can be derived in the following way:

\begin{equation} v_\pi(s) = E_\pi \left[ \sum_{k=0}^\infty{\gamma^{k} R_{k+t}}| s_t = s \right] = \sum_a{\sum_{s’}{\pi(a|s)T(s’,a,s) \left(R_t + \gamma E_\pi\left[ \sum_{k=0}^\infty{\gamma^{k} R_{k+t+1}}| s_{t+1} = s’ \right] \right)}}. \end{equation}

The reward function only depends on the current state $s$. Therefore the transition probability times action probability will sum up to 1. Then the following formula yields:

\begin{equation} v_\pi(s_t) = R_t + \gamma \sum_{a, s’}{\pi(a|s) T(s’,a,s) v_\pi(s’)}. \end{equation}

This is the so called Bellman-equation. The Bellman-equation is the key to find the best policy for a given learning problem. If the policy is fixed then the value-function can be determined by solving the Bellman-equation. Then the agent will know what is the value of its current state so it can modify its own policy to achieve better performance. For the new policy a new value-function will satisfy the Bellman-formula. This process can be repeated until the optimal policy is reached. The next part will show this in details.

Summary

The Bellman-formula is the key equation to solve a reinforcement learning problem and obtain a good policy. To arrive to this equation it was assumed that the dynamics of the environment can be described as a Markov chain and it was assumed also that the utility function is separable (this indicated the last step in the derivation).

Shortly:

An MDP is a decision sequence where the sequence of states holds the Markov property.
The value-function has two forms: state-value function and action-value function. It shows the expected return (utility) gathered during the path started from the current state and by following the policy.
There is a dynamic programming relation between the consequtive states which is formalized by the Bellman-equation.
The Bellman-equation is the corner stone of solving a reinforcement learning task.

Examplary rl algorithms can be found in my github repository.

View on GitHub