Introduction
This is the first part of a four parts long series about reinforcement learning (hereafter RL). First, I will introduce the different types of machine learning. Then the focus will shift toward RL. I will explain the most important terms and show some examples. The goal of this part is to become more familiar with the core concepts of RL.
Types of Machine Learning
Different learning models
The picture above shows the different types of machine learning:
- supervised learning
- unsupervised learning
- reinforcement learning
Supervised learning uses the input data and the output data with correct answers. The goal of supervised learning is to find connection between the given inputs and outputs then create a model which can generalize this connection. The outputs for new inputs can be determined by the learnt model. Supervised learning has basically 2 types: classification and regression. In classification the output is a label which represents a class where the input belongs. For example the inputs can be images from handwritten numbers and the outputs are the concrete numbers the images show. In case of regression the output is a continuous value. For instance the input can be the properties of a house - size, where it is, age etc. - and the output is the price of the house.
The learning is an iterative procedure and it is based on a so called loss function which depends on the differences between the right solution and the prediction of the model. The model is corrected during the iteration according to the losses.
Unsupervised learning uses only the input data. The right solutions are not known during learning. The goal is to find patterns, inner structures in the data. Most of the time this means to find clusters: chunks of data which are closely related to each other in some way. This is the so called clustering. The importance of unsupervised learning is that most of the available data is unlabelled.
Reinforcement learning provides a much general learning model. This learning model is very similar to the way how animals and humans learn. It makes possible to interact with the environment and collect further information on-the-fly. In the subsequent sections this will be covered in details.
RL basics
A typical RL process follows the following pattern. First, the agent observe the state (see later) of the environment. Second, the agent makes intervention (do some action) in the environment to change its state. Third, the environment gives a feedback, so called reward, for the agent how successful was its action. Then these three steps keeps repeated. If the task is a so called epsiodic task, then the process will terminate in some of the states (termination states). On the other hand a task can be continuing, which means the task will never end. An epsiodic task can have finite horizon or infinite horizon. In case of finite horizon, the task will be finished after a given time no matter the current state. In case of infinite horizon, in the task there is no time limit. An example for episodic tasks (with infinite horizon) is a board game where there is no limitations for the number of steps. An example for continuing tasks is a software which wants to sustain the energy supply for a city by controlling the power plant. The following sequence shows the three steps consequtively:
\[....\ \rightarrow\ s_t\ \rightarrow\ a_t\ \rightarrow\ r_{t+1}\ \rightarrow\ s_{t+1}\ \rightarrow\ a_{t+1}\ \rightarrow\ r_{t+2}\ \rightarrow\ s_{t+2}\ ....\]The goal of reinforcement learning is to find a policy (a function which helps to decide the next action in a state) which ensures the most rewards when the trained agent applied.
The state of the environment is a description of its most important features which can be changed by the agent. These features genuinely depends of the concrete problem.
The policy, from the mathematical point of view, is a mapping from the state space to the action space. This mapping has two basic forms: deterministic policy and stochastic policy. Deterministic policy has the form:
\[a = \pi(s),\ \pi:S \rightarrow A\]While the stochastic policy:
\[p(a|s) = \pi(a,s),\ \pi:S\times A \rightarrow P\]The S means the state space, A means the action space, P means a probability distribution over A. In this sense the deterministic policy can be viewed as a special stochastic policy which has a probability distribution with one probability at the choosen action and zero at all the other.
In order to define the goal in more precise way, let’s introduce the discounted return:
\begin{equation} G_t = \sum^\infty_{k=0}{\gamma^k R_{t+1+k}} \end{equation}
Where $\gamma$ is the so called discounting factor. The discounting factor originally was introduced for continuing tasks. It helps to decrease the weight of the reward which will be received in the future. This reflects the concepts that the reward of the present is more valuable than the reward of the future. In fact, the so called average reward is the same as the dicounting reward in case of a continuing task and it is free from a further hyperparameter ($\gamma$). In episodic tasks the discounting factor can be one but mathematically it is better to choose a value with less than one.
In the next chapter the basic mathematical concepts will be covered: MDP, Bellmann-equation, value-functions.
Examplary rl algorithms can be found in my github repository.